# Pandas Guide


### Table of Contents 
1. Understanding the Pandas
    - Data Viewing 
    - Understanding the components of the DF 
        - Series
        - Index 
2. Importing/Converting Data into Pandas
    - Importing Data 
        - Excel 
        - Stata File
        - SAS File
        - JSON File
        - Online File
    - Converting Data from already existing data structure 
        - List
        - Dictionary 
        - NP Array 
3. Data Manipulation 
    - Creating Variables
    - Managing Variables
    - Managing Missing Values
    - Check duplicates
    - Converting data formats
    - Cleaning Dates
    - String Commands
    - Renaming Variables (Mass renaming)
    - Reshaping Data
    - Merging/Appending Data
4. Regression Analysis 
    - Statsmodels
5. Outputs from Pandas 
    - Tabouts 
    - Export Summary Stats
    - Tables from Regressions 
    - Graphs and So On
    
Resources: 
1. https://plot.ly/python/
2. https://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/pandas_and_matplotlib.ipynb
3. https://cheatsheets.quantecon.org/stats-cheatsheet.html
4. https://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/sci_python_quickstart.ipynb
5. https://cheatsheets.quantecon.org/
6. https://seaborn.pydata.org/ 
7. https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/ 
8. https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/
9. https://paulovasconcellos.com.br/28-useful-pandas-functions-you-might-not-know-de42c59db085
10. https://realpython.com/python-pandas-tricks/
11. https://dataconomy.com/2015/03/14-best-python-pandas-features/

In [1]:
import fitz                           # import PyMuPDF
doc = fitz.open("some.pdf")           # or new: fitz.open(), followed by insertPage()
page = doc[n]                         # choose some page
rect = fitz.Rect(50, 100, 300, 400)   # rectangle (left, top, right, bottom) in pixels

text = """This text will only appear in the rectangle. Depending on width, new lines are generated as required.\n<- This forced line break will also appear.\tNow a very long word: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.\nIt will be broken into pieces."""

rc = page.insertTextbox(rect, text, fontsize = 12, # choose fontsize (float)
                   fontname = "Times-Roman",       # a PDF standard font
                   fontfile = None,                # could be a file on your system
                   align = 0)                      # 0 = left, 1 = center, 2 = right

print("unused rectangle height: %g" % rc)          # just demo (should display "44.2")

doc.saveIncr()   # update file. Save to new instead by doc.save("new.pdf",...)

ModuleNotFoundError: No module named 'fitz'

## 1. Understanding Pandas

### What is PANDAS and why should I lean it? 


Pandas is a popular package in Python for data science. Pandas offers flexible data structures that make data manipulation and analysis intuitive for researchers. 

Most - if not all - data analysis done by social science researchers will utlize Pandas, and for good reason - the format of a Pandas DataFrame (the signature object offered by the program) mirrors that of the DataFrame in R and Stata. 

Using Python - and by extension, most likely Pandas -  for data analysis offers multiple benefits: 

    1. You will be able to use a host of unique Python packgages for text analysis
    2. You will be able to turn data mining projects into useable csv formats
    3. Data visualizations
    4. Machine Learning implementation
    5. Clean code (actually Readable syntax)
    6. Jupyter notebook to share data nad process 
    

### Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [114]:
df['Year'].head()

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64

Here, data can be many different things:

a Python dict
an ndarray
a scalar value (like 5)
The passed index is a list of axis labels.

### DataFrame 
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

In [115]:
d = {'one': [1., 2., 3., 4.],
     'two': [4., 3., 2., 1.]}
d = pd.DataFrame(d)

print(d)

   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0


In [116]:
d

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


## 2. Importing Data

### Importing data

In [117]:
# Importing data from Stata 
df = pd.read_stata('.\State_ETO.dta')

In [118]:
# Importinf data from csv into pandas
df = pd.read_excel('Minimum Wage Data.xlsx')

In [119]:
# Similar variants exists for each commnon data type (be it csv, sas and so on!)

### Converting Data

In [120]:
# List
my_list = [1,2,3,4,5,6,7,8,9]
pd.DataFrame(np.array(my_list).reshape(3,3), columns = list("abc"))


Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [121]:
# Dictionary
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

Unnamed: 0,col_1,col_2
0,3,a
1,2,b
2,1,c
3,0,d


In [122]:
# NUmpy Array
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = np.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

pd.DataFrame(values, index=index).head()

Unnamed: 0,Col1,Col2,Col3
Row1,0,0.0,0.0
Row2,0,0.0,0.0
Row3,0,0.0,0.0
Row4,0,0.0,0.0
Row5,0,0.0,0.0


## 3. Data Manipulation

In [123]:
# Creating a variable
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  
0       0.00      0.00  
1      15.12     15.12  
2       4.75      3.37  
3       1.12      1.12  
4      11.88     11.88  


In [124]:
df['7'] = 7
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  7  
0       0.00      0.00  7  
1      15.12     15.12  7  
2       4.75      3.37  7  
3       1.12      1.12  7  
4      11.88     11.88  7  


In [125]:
df['Difference'] = df['High.Value'] + df['Low.Value']
df['High*2'] = df['High.Value']*2
df['Low*2'] = df['Low.Value']*2

In [126]:
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  7  Difference  High*2   Low*2  
0       0.00      0.00  7      0.0000  0.0000  0.0000  
1      15.12     15.12  7      4.2000  4.2000  4.2000  
2       4.75      3.37  7      1.1280  1.3200  0.9360  
3       1.12      1.12  7      0.3125  0.3125  0.3125  
4      11.88     11.88  7      3.3000  3.3000  3.3000  


In [129]:
# Managing Missing Values

# in Nan Format 

In [130]:
# Check duplicates
df[df.duplicated(keep=False)]

Unnamed: 0,Year,State,Table_Data,High.Value,Low.Value,CPI.Average,High 2018,Low.2018,7,Difference,High*2,Low*2


In [96]:
# Cleaning Dates
import pandas as pd

df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)


df['DOB'] = pd.to_datetime(df.DOB)
print (df)


df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)


In [97]:
# String Commands
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

s.str.lower()

s.str.upper()
    
s.str.len()


df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')


s2.str.split('_')

dollars.str.replace('$', '')
## Be wary of special chacacters

In [98]:
# Renaming Variables (Mass renaming)

In [99]:
# Reshaping Data

In [100]:
# Merging/Appending Data

In [101]:
# Regression Analysis 

In [102]:
#  Statsmodels

In [103]:
# Outputs from Pandas 

In [104]:
# Tabouts 

In [105]:
# Export Summary Stats

In [106]:
# Tables from Regressions 

In [107]:
# Graphs and So On

In [None]:
# Useful Functions

In [None]:
#1 – Boolean Indexing

In [None]:
#2 – Apply Function

In [None]:
#3 – Imputing missing files

In [None]:
#4 – Pivot Table

In [None]:
#5 – Multi-Indexing

In [None]:
#6. Crosstab

In [None]:
#7 – Merge DataFrames

In [None]:
#8 – Sorting DataFrames

In [None]:
#9 – Plotting (Boxplot & Histogram)

In [None]:
#10 – Cut function for binning

In [None]:
#11 – Coding nominal data

In [None]:
#12 – Iterating over rows of a dataframe