# Pandas Guide


### Table of Contents 
1. Understanding the Pandas
    - What is Pandas? 
    - Components of Pandas
        - DataFrame
        - Series
        - Index 
    - Importing Pandas
2. Importing/Converting Data into Pandas
    - Importing Data 
        - Stata File
        - Excel 
        - SAS File
        - JSON File
        - Online File
    - Converting Data from already existing data structure 
        - List
        - Dictionary 
        - NP Array 
3. Data Analysis 
    - Understanding our data: Viewing, Subsetting, and Summary Statistics
        - Data Viewing 
        - Subsetting data
        - Summary Statistics
    - Data Manipulation
        - Creating a new variable
        - Replacing a variable
        - Renaming Variables
        - Dropping Variables
        - Simple Math to Create a new variable
        - Complex Variable Modification with .Map()
        - Using a Function
        - Lambda Functions
        - Missing Values
        - Reshape Data
        - Merging Data
        - Append Data
        - Collapsing Data
5. Additional Resources
   

## 1. Understanding Pandas

### What is Pandas?


### Why Python for Data Analysis 

Python is an open-source general purpose programming language with bulit-in functions, clean and readable syntax, and an active community developing products to improve Python's abilities. Python's simple and easy-to-understand syntax makes the learning curve much less steep compared to most other statistical software packages and programming langauges. Python has made great strides in its data analysis capabilities in the past decade with new machine learning packages like SciKit-learn, data visualizations tools like Seaborn and Plotly, and text analysis packages like NTLK or Gensim. Python's multiple IDE options, like Spyder, PyCharm, or Juypter Notebook, also give users flexibility in how they share their work with others.

### Pandas -  How to begin data analysis in Python 

While the built-in functions suffice for general programming, data analysis requires additional functions and objects. Pandas is a popular data analysis package that is simple to use the moment you start Python.

The reason Pandas is popular for data science and analysis is that it introduces three useful objects that mirror similar data structures from other statistical packages and don't add complexity to the simple Pythonic syntax. These objects are:

    1. The DataFrame
    2. The Series
    3. The Index

The rest of this chapter goes as follows: The first section of this guide will cover the objects and how they function. The second section of this guide will show how one can import and convert data into the DataFrame object. The third section will cover the steps needed to understand and maniuplate your data before analysis. The fourth section covers regression analysis and intrepreting the results. And the final section covers outputting results either in the form of data visualization or summary statistic.

This guide is not meant to be comprehensive, the most comprehensive documentation for Pandas is found (here)[https://pandas.pydata.org/pandas-docs/stable/pandas.pdf]. Documentation for any program is long and don't, often times, reflect the best practices a programming community has established. The purpose of this document is to inform a potential user of the functionality of Pandas and a general overview of how to accomplish basic data analysis tools.

### Importing Pandas
Since Pandas is not native to Python, you will have to install it. Fortunately, Pandas has grown so much in popularity, that most downloads will contain Pandas, but you will still have to load the package. This code will do the trick:

In [1]:
import pandas 

However, pandas is used often, so it might be easier to give pandas a "nickname" that is common practice. The Pandas community online usually loads pandas as "pd" for ease of access: 

In [2]:
import pandas as pd

It's also useful to import a few other packages, mostly notably numpy.

In [3]:
import numpy as np

### DataFrame

The DataFrame is the main contribution of Pandas. The DataFrame is a 2-dimensional labeled structure with columns and rows. The columns and the rows represent one dimension each. The DataFrame is anaglous to the R and Stata DataFrame and the Excel spreadsheet. Or put in more technical terms, the DataFrame is a tabular datastructure. The code below defines a DataFrame and then prints it out to view:



In [4]:
d = pd.DataFrame({'one': [1., 2., 3., 4.],
     'two': [4., 3., 2., 1.]})

d

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0



"d" is the name of our DataFrame and it has two columns, one and two, and four rows, 0-3. (Python is 0 indexed, meaning it counts from 0...1...2...and so on.)

Each datapoint can be referenced through its Index (which corresponds to its row, the far most left value) and the column signifies what the value means. We can call a single Pandas column with this command:

In [5]:
(d['one'])

0    1.0
1    2.0
2    3.0
3    4.0
Name: one, dtype: float64

The values on the left represent the Index we saw earlier. Notice: A Python's column's type is itself a Pandas object, the Series. More on this obect below.



### Series
A Series ia one-dimensional indexed array that can contain any Python data type. To create a Series you use the function: 

In [6]:
series_ex = pd.Series([1,2,3,4])

 A Series in Pandas is similar visually to a list. But there are key distinctions in how they opearte. As mentioned, Pandas is used for data analysis, so a Series has functions that allow for data analysis to be done easily, while a list would require either a for loop or list comprehension for the same operations. Example of this below:

In [7]:
series_ex = series_ex*2
print("This is a Series multipled by two")
print(series_ex)

This is a Series multipled by two
0    2
1    4
2    6
3    8
dtype: int64


In [8]:
list = [1,2,3,4]
list = list*2
print("This is a List multipled by two")
print(list)

This is a List multipled by two
[1, 2, 3, 4, 1, 2, 3, 4]


### Index

Both the Series and the DataFrame have an index that signifies order and allows for referencing specific points. The Index itself is an object - though by itself it holds little purpose. 

In [9]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

## 2. Importing Data

Pandas has the ability to read and export multiple data format types: be it a csv, dta, sas file, json file, sql and many more. Almost all data reading will be in the format : pd.read_(NAME OF DATA FORMAT)('FILE_NAME'). Let's take a look at a few examples

### Importing data

#### Stata Files (DTA)

In [10]:
# Importing data from Stata 
pd.read_stata('data/State_ETO_short.dta')

Unnamed: 0,state_abbv,ETO,ETW,ETOW,ET_cat,ET_Work_cat,ET_ET_Work_cat
0,VA,2.0,62.5,64.5,0 - 5%,30% <,30% <
1,TN,5.5,20.8,26.3,5 - 10%,10 - 30%,20 - 30%
2,VT,19.9,0.7,20.6,15 - 20%,0 - 5%,20 - 30%
3,ID,6.0,11.4,17.4,5 - 10%,10 - 30%,10 - 20%
4,OH,1.9,15.0,16.9,0 - 5%,10 - 30%,10 - 20%


This is how we read in a DataFrame, but we still need to store and name it.This can be accomplished in one line:

In [11]:
df = pd.read_stata('data/State_ETO_short.dta')

Now, when we call "df", we'll get the DataFrame that corresponds to the data referenced in "pd.read_stata('data/State_ETO.dta')"

In [12]:
df

Unnamed: 0,state_abbv,ETO,ETW,ETOW,ET_cat,ET_Work_cat,ET_ET_Work_cat
0,VA,2.0,62.5,64.5,0 - 5%,30% <,30% <
1,TN,5.5,20.8,26.3,5 - 10%,10 - 30%,20 - 30%
2,VT,19.9,0.7,20.6,15 - 20%,0 - 5%,20 - 30%
3,ID,6.0,11.4,17.4,5 - 10%,10 - 30%,10 - 20%
4,OH,1.9,15.0,16.9,0 - 5%,10 - 30%,10 - 20%


#### Excel Files (XLSX)

In [13]:
# Importing data from excel into pandas
df = pd.read_excel('data/Minimum_Wage_Data_Short.xlsx')
df

Unnamed: 0,Year,State,Table_Data,High.Value,Low.Value,CPI.Average,High 2018,Low.2018
0,1968,Alabama,,0.0,0.0,34.783333,0.0,0.0
1,1968,Alaska,2.1,2.1,2.1,34.783333,15.12,15.12
2,1968,Arizona,18.72 - 26.40/wk(b),0.66,0.468,34.783333,4.75,3.37
3,1968,Arkansas,1.25/day(b),0.15625,0.15625,34.783333,1.12,1.12
4,1968,California,1.65(b),1.65,1.65,34.783333,11.88,11.88
5,1968,Colorado,1.00 - 1.25(b),1.25,1.0,34.783333,9.0,7.2


#### Other data types
Similar variants for uploading exists for each common data type - CSV, SAS, and so on.

### Exporting Data

Exporting data is very simple as well and follow a pattern similar to exporting.

#### CSV

In [14]:
df.to_csv('exports/Minimum_Wage_Data.csv')

### Converting Data

Now that we know how to load in data, it will be useful to examine ways to convert already existing data structures into DataFrames. 

#### List

In [15]:
my_list = [1,2,3,4,5,6,7,8,9]
columns = ['a', 'b', 'c']

pd.DataFrame(np.array(my_list).reshape(3,3), columns = columns)

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


An important thing to note: pd.DataFrame has a multitude of options. The most import of which is what follows right after the first parentheses which is the data that is to be transformed. Here are transform the list [1,2,3,4,5,6,7,7,8,9] into an np.array with the shape of: 

    [[1,2,3],
    [4,5,6], 
    [7.8.9]]
    
Then we transform the data to a pandas dataframe which gives us: 

    0	1	2	3
    1	4	5	6
    2	7	8	9
    
    
Finally, we add a list of column name with the option columns = columns to get the final dataframe.


        a	b	c
    0	1	2	3
    1	4	5	6
    2	7	8	9


#### Dictionary

In [16]:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

Unnamed: 0,col_1,col_2
0,3,a
1,2,b
2,1,c
3,0,d


#### Numpy Array

In [17]:
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = np.zeros(5, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

pd.DataFrame(values, index=index).head()

Unnamed: 0,Col1,Col2,Col3
Row1,0,0.0,0.0
Row2,0,0.0,0.0
Row3,0,0.0,0.0
Row4,0,0.0,0.0
Row5,0,0.0,0.0


### 3. Data Viewing

Now that we know how to load in our DataFrame, we will try to view and manipulate our data before our analysis is run. 

### Understanding our data: Viewing, Subsetting, and Summary Statistics



#### Data Viewing 

We know that we can view our data by simply printing the dataframe, but what if our data is too large?

.head() prints out the first 5 rows. (inside the parathetnicals you can specify the first N observations you want to see)

In [18]:
df.head()

Unnamed: 0,Year,State,Table_Data,High.Value,Low.Value,CPI.Average,High 2018,Low.2018
0,1968,Alabama,,0.0,0.0,34.783333,0.0,0.0
1,1968,Alaska,2.1,2.1,2.1,34.783333,15.12,15.12
2,1968,Arizona,18.72 - 26.40/wk(b),0.66,0.468,34.783333,4.75,3.37
3,1968,Arkansas,1.25/day(b),0.15625,0.15625,34.783333,1.12,1.12
4,1968,California,1.65(b),1.65,1.65,34.783333,11.88,11.88


In [19]:
df.head(2)

Unnamed: 0,Year,State,Table_Data,High.Value,Low.Value,CPI.Average,High 2018,Low.2018
0,1968,Alabama,,0.0,0.0,34.783333,0.0,0.0
1,1968,Alaska,2.1,2.1,2.1,34.783333,15.12,15.12


You can also do all of these methods on a single series

In [20]:
df['Year'].head()

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64

Or, you can do multiple at one time

In [21]:
df[['State', 'Table_Data']].head()

Unnamed: 0,State,Table_Data
0,Alabama,
1,Alaska,2.1
2,Arizona,18.72 - 26.40/wk(b)
3,Arkansas,1.25/day(b)
4,California,1.65(b)


It's also good to view general information on the dataframe from .info() and to understand the datatypes of each of the columns

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
Year           6 non-null int64
State          6 non-null object
Table_Data     5 non-null object
High.Value     6 non-null float64
Low.Value      6 non-null float64
CPI.Average    6 non-null float64
High 2018      6 non-null float64
Low.2018       6 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 464.0+ bytes


In [23]:
df.dtypes

Year             int64
State           object
Table_Data      object
High.Value     float64
Low.Value      float64
CPI.Average    float64
High 2018      float64
Low.2018       float64
dtype: object

Similar to our use of head, we can use tail to examine the last few data points

In [24]:
df.tail(2)

Unnamed: 0,Year,State,Table_Data,High.Value,Low.Value,CPI.Average,High 2018,Low.2018
4,1968,California,1.65(b),1.65,1.65,34.783333,11.88,11.88
5,1968,Colorado,1.00 - 1.25(b),1.25,1.0,34.783333,9.0,7.2


#### Subsetting data

We can do traditional slicing through the index [start, end]. This will subset by rows. 

We can also do subsetting by the series df[['columns_we_want_1, 'columsn_we_want_2']]. This will subset by columns.

Slicing by the index is similar to slicing any Python object by its index.

##### Slicing the data by rows

In [25]:
df[1:4]

Unnamed: 0,Year,State,Table_Data,High.Value,Low.Value,CPI.Average,High 2018,Low.2018
1,1968,Alaska,2.1,2.1,2.1,34.783333,15.12,15.12
2,1968,Arizona,18.72 - 26.40/wk(b),0.66,0.468,34.783333,4.75,3.37
3,1968,Arkansas,1.25/day(b),0.15625,0.15625,34.783333,1.12,1.12


Can also store in a variable to reference later

Subsetting by columns just refers to calling a specific series - as seen earlier. 

##### Slicing the data by columns

In [26]:
columns_you_want = ['State', 'Table_Data'] 
df[columns_you_want].head()

Unnamed: 0,State,Table_Data
0,Alabama,
1,Alaska,2.1
2,Arizona,18.72 - 26.40/wk(b)
3,Arkansas,1.25/day(b)
4,California,1.65(b)


Let's bring in a larger dataset for this analysis to be meaningful. This will be a panel dataset of states, so there will be 50 rows for every year 

In [27]:
df = pd.read_excel("data/Minimum_Wage_Data.xlsx")

In [28]:
print("This is the first five observations \n")
print(df[['Year', 'State']].head(5))
print("\n")

print("This is the last five observations \n")
print(df[['Year', 'State']].tail(5))

This is the first five observations 

   Year       State
0  1968     Alabama
1  1968      Alaska
2  1968     Arizona
3  1968    Arkansas
4  1968  California


This is the last five observations 

      Year          State
2745  2017       Virginia
2746  2017     Washington
2747  2017  West Virginia
2748  2017      Wisconsin
2749  2017        Wyoming


Viewing the size and columns of the data:

In [29]:
print(len(df))
print(df.columns)

2750
Index(['Year', 'State', 'Table_Data', 'High.Value', 'Low.Value', 'CPI.Average',
       'High 2018', 'Low.2018'],
      dtype='object')


We have 8 variables: Year, State, Table_Data, High.Value, Low.Value, CPI.Average, High 2018, and Low 2018. Sometimes in our analysis we only want to keep certain years. For this, the traditional boolean logic mixed with Pandas slices the data into the segments we want.

#### Slicing based on conditions

In [30]:
# Rows past 2015
print(df[df.Year > 2015].head(3))

      Year    State Table_Data  High.Value  Low.Value  CPI.Average  High 2018  \
2640  2016  Alabama        ...        0.00       0.00   240.007167       0.00   
2641  2016   Alaska       9.75        9.75       9.75   240.007167      10.17   
2642  2016  Arizona       8.05        8.05       8.05   240.007167       8.40   

      Low.2018  
2640      0.00  
2641     10.17  
2642      8.40  


Segmenting on multiple conditions is also desirable. Most programs allow for an "AND" operator and an "OR" operator.

In [31]:
# California AND 2010
print(df[(df.Year == 2010) & (df.State == 'California')].head(3))
print('\n')

      Year       State Table_Data  High.Value  Low.Value  CPI.Average  \
2314  2010  California          8         8.0        8.0     218.0555   

      High 2018  Low.2018  
2314       9.19      9.19  




In [32]:
# Alabama OR before 2015
print(df[(df.State == "Alabama") | (df.Year < 2015)].head(3))

   Year    State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968  Alabama                  NaN        0.00      0.000    34.783333   
1  1968   Alaska                  2.1        2.10      2.100    34.783333   
2  1968  Arizona  18.72 - 26.40/wk(b)        0.66      0.468    34.783333   

   High 2018  Low.2018  
0       0.00      0.00  
1      15.12     15.12  
2       4.75      3.37  


The traditional Python index slicers are also applicable for DataFrames with the .loc[] method.

In [33]:
print(df.iloc[99])
print('\n')
print(df.iloc[[1, 50, 300]])

Year                       1969
State              South Dakota
Table_Data     17.00 - 20.00/wk
High.Value                  0.5
Low.Value                 0.425
CPI.Average             36.6833
High 2018                  3.41
Low.2018                    2.9
Name: 99, dtype: object


     Year      State  Table_Data  High.Value  Low.Value  CPI.Average  \
1    1968     Alaska         2.1         2.1       2.10    34.783333   
50   1968   Virginia         ...         0.0       0.00    34.783333   
300  1973  Minnesota  .75 - 1.60         1.6       0.75    44.400000   

     High 2018  Low.2018  
1        15.12     15.12  
50        0.00      0.00  
300       9.02      4.23  


In [34]:
print(df.loc[100])
print('\n')
print(df.loc[[2, 51, 301]])

Year                1969
State          Tennessee
Table_Data           ...
High.Value             0
Low.Value              0
CPI.Average      36.6833
High 2018              0
Low.2018               0
Name: 100, dtype: object


     Year        State           Table_Data  High.Value  Low.Value  \
2    1968      Arizona  18.72 - 26.40/wk(b)        0.66      0.468   
51   1968   Washington                  1.6        1.60      1.600   
301  1973  Mississippi                  ...        0.00      0.000   

     CPI.Average  High 2018  Low.2018  
2      34.783333       4.75      3.37  
51     34.783333      11.52     11.52  
301    44.400000       0.00      0.00  


#### Summary Statistics
Often times, exporting summary stats in a different document type is the bet way to visualize the results and understand the data.

In [35]:
df.describe()
print(df.describe())
np.round(df.describe(), 2)
np.round(df.describe(), 2).T
df.describe().transpose().to_csv('summary_stats.csv', sep=',')

              Year   High.Value    Low.Value  CPI.Average    High 2018  \
count  2750.000000  2739.000000  2739.000000  2750.000000  2739.000000   
mean   1992.500000     3.653761     3.533555   138.828983     6.441486   
std      14.433494     2.560308     2.539424    65.823807     3.025140   
min    1968.000000     0.000000     0.000000    34.783333     0.000000   
25%    1980.000000     1.600000     1.600000    82.408333     5.980000   
50%    1992.500000     3.350000     3.350000   142.387500     7.370000   
75%    2005.000000     5.150000     5.150000   195.291667     8.280000   
max    2017.000000    11.500000    11.500000   245.119583    15.120000   

          Low.2018  
count  2739.000000  
mean      6.200252  
std       3.017818  
min       0.000000  
25%       5.230000  
50%       7.170000  
75%       8.070000  
max      15.120000  


### Data Manipulation

The standard way to create a new Pandas column or replace an old one is to call the Series (whether it exists or not) on the left hand side and set it equal to the expression that expresses that value you want to create. For example:



#### Creating a new variable

In [36]:
df['7'] = 7
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  7  
0       0.00      0.00  7  
1      15.12     15.12  7  
2       4.75      3.37  7  
3       1.12      1.12  7  
4      11.88     11.88  7  


Or, we can replace an old variable in a similar way:

#### Replacing a variable

In [37]:
df['7'] = 8
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  7  
0       0.00      0.00  8  
1      15.12     15.12  8  
2       4.75      3.37  8  
3       1.12      1.12  8  
4      11.88     11.88  8  


We can also rename variables. In this case we will rename "Year" to "Date" and "Table_Data" to "Values."

#### Renaming Variables

In [38]:
print(df.head())
print("\n")
print(df.rename(index=str, columns={"Year": "Date", "Table_Data": "Values"}).head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  7  
0       0.00      0.00  8  
1      15.12     15.12  8  
2       4.75      3.37  8  
3       1.12      1.12  8  
4      11.88     11.88  8  


   Date       State               Values  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          

#### Dropping Variables
We can also drop a variable with df.drop(): 

In [39]:
# Dropping a variable
df = df.drop("7", axis=1)
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  
0       0.00      0.00  
1      15.12     15.12  
2       4.75      3.37  
3       1.12      1.12  
4      11.88     11.88  


#### Simple Math to Create a New Variable
Basic math functions are easily applied in Pandas:

In [40]:
df['Difference'] = df['High.Value'] - df['Low.Value']
df['High*2'] = df['High.Value']*2
df['Low*2'] = df['Low.Value']*2
print(df.head())

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  Difference  High*2   Low*2  
0       0.00      0.00       0.000  0.0000  0.0000  
1      15.12     15.12       0.000  4.2000  4.2000  
2       4.75      3.37       0.192  1.3200  0.9360  
3       1.12      1.12       0.000  0.3125  0.3125  
4      11.88     11.88       0.000  3.3000  3.3000  


#### Complex Variable Modification with .Map()

More complex operations can be solved through the .map() method. In the example below, a function using sklearn's min/max scaler will be applied to a Pandas column:

.map() could use a dictionary to change results. This could help clean code when replacing a large amount of values. Here we create an abbrev column for state abbreviations (we will create it and drop the variable after)

In [41]:
# Using Data.map
state_2 = {'OH': 'Ohio', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX'}

df['abbrev'] = df['State'].map(state_2)

In [42]:
df.sort_values('abbrev', ascending=True).head()
df = df.drop("abbrev", axis=1)

#### Using a Function

You can use .apply() to apply a function to a Series. You can either specify the function or use the lambda anonymous function:

Specfying a function:


In [43]:
def add(x):
    x = x + 1
    return x

print(df['Year'].head())
print("\n")
print(df['Year'].apply(add).head())

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64


0    1969
1    1969
2    1969
3    1969
4    1969
Name: Year, dtype: int64


#### Lambda Functions
Lambda Functions (Skipping defining a function if the function is simple enough):

In [44]:
print(df['Year'].head())
print("\n")
print((df['Year'].apply(lambda x: x + 1).head()))

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64


0    1969
1    1969
2    1969
3    1969
4    1969
Name: Year, dtype: int64


#### Missing Values

Missing values are a huge part of data cleaning and something to always be mindful of. Pandas codes missing values as numpy NAN values. Let's look



In [45]:
print(df['Table_Data'].head(10))
print("\n")
print(df['Year'].head(10))

0                    NaN
1                    2.1
2    18.72 - 26.40/wk(b)
3            1.25/day(b)
4                1.65(b)
5         1.00 - 1.25(b)
6                    1.4
7                   1.25
8            1.25 - 1.40
9          $1.15 & $1.60
Name: Table_Data, dtype: object


0    1968
1    1968
2    1968
3    1968
4    1968
5    1968
6    1968
7    1968
8    1968
9    1968
Name: Year, dtype: int64


Discovering missing values is  important: 

In [46]:
print(df['Table_Data'].isnull().values.any())
print(df['Year'].isnull().values.any())

True
False


We can also use .drop_duplicates(keep = "first") to drop all but the first observations of duplicates.



In [47]:
data = {'col_1': [3, 3, 3, 3], 'col_2': ['a', 'a', 'a', 'a']}
data_dup = pd.DataFrame.from_dict(data)
print(data_dup)

print("\n")

data_dup = data_dup.drop_duplicates(keep="last")

print(data_dup)

   col_1 col_2
0      3     a
1      3     a
2      3     a
3      3     a


   col_1 col_2
3      3     a


#### Reshape Data

Data can be reshaped using a variety of Pandas functions. Going from wide to long has a built in function. We will load in a new dataset for this:

In [48]:
df_ex = pd.DataFrame({
    'Unique Family Identifier': [1000, 1000, 1000, 1001, 1001, 1001, 1002, 324, 234],
    'Order': [1, 2, 3, 1, 2, 3, 1, 2, 3], 
    'az1': [28, 82, 23, 234, 234, 324, 546, 546, 5464],
    'az2': [2342, 2342, 54645, 56765, 65765, 65756, 3453, 56756, 3453]})


reshape = pd.wide_to_long(df_ex, stubnames='az', i=['Unique Family Identifier', 'Order'], j='age')

print(df.head())
print("\n")
print(reshape)

   Year       State           Table_Data  High.Value  Low.Value  CPI.Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High 2018  Low.2018  Difference  High*2   Low*2  
0       0.00      0.00       0.000  0.0000  0.0000  
1      15.12     15.12       0.000  4.2000  4.2000  
2       4.75      3.37       0.192  1.3200  0.9360  
3       1.12      1.12       0.000  0.3125  0.3125  
4      11.88     11.88       0.000  3.3000  3.3000  


                                       az
Unique Family Identifier Order age       
1000                     1     1       28
                               2     2342
                         2     

Going from wide to long requires the use of unstack()

In [49]:
normal = reshape.unstack()
normal.columns = normal.columns.map('{0[0]}{0[1]}'.format)
normal.reset_index()
print(normal.head())

                                 az1    az2
Unique Family Identifier Order             
234                      3      5464   3453
324                      2       546  56756
1000                     1        28   2342
                         2        82   2342
                         3        23  54645


#### Merging Data

Merging data uses pd.merge(data_1, data_2, on = identifier):

In [50]:
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)

   key left_value
0    0          a
1    1          b
2    2          c
3    3          d
4    4          e


   key right_value
0    2           f
1    3           g
2    4           h
3    5           i
4    6           j


In [51]:
pd.merge(left_frame, right_frame, on='key', how='inner')

Unnamed: 0,key,left_value,right_value
0,2,c,f
1,3,d,g
2,4,e,h


In [52]:
pd.merge(left_frame, right_frame, on='key', how='left')

Unnamed: 0,key,left_value,right_value
0,0,a,
1,1,b,
2,2,c,f
3,3,d,g
4,4,e,h


In [53]:
pd.merge(left_frame, right_frame, on='key', how='right')

Unnamed: 0,key,left_value,right_value
0,2,c,f
1,3,d,g
2,4,e,h
3,5,,i
4,6,,j


#### Append Data

In [54]:
pd.concat([left_frame, right_frame], sort='True')

Unnamed: 0,key,left_value,right_value
0,0,a,
1,1,b,
2,2,c,
3,3,d,
4,4,e,
0,2,,f
1,3,,g
2,4,,h
3,5,,i
4,6,,j


#### Collapsing Data

Collapsing data is accomplished by the .groupby() function. We will collapse by Year. 

In [55]:
by_year = df.groupby('Year')
print(by_year.count().head()) # NOT NULL records within each column

      State  Table_Data  High.Value  Low.Value  CPI.Average  High 2018  \
Year                                                                     
1968     55          53          54         54           55         54   
1969     55          54          54         54           55         54   
1970     55          54          54         54           55         54   
1971     55          54          54         54           55         54   
1972     55          54          54         54           55         54   

      Low.2018  Difference  High*2  Low*2  
Year                                       
1968        54          54      54     54  
1969        54          54      54     54  
1970        54          54      54     54  
1971        54          54      54     54  
1972        54          54      54     54  


Note: This only generates counts for each year. If we want averages, sums or any other summary statistics we have to specify that: 

In [56]:
print(by_year.sum()[20:25])
print('\n')
print(by_year.mean()[20:25]) 
print('\n')
print(by_year.median()[20:25]) 

      High.Value  Low.Value  CPI.Average  High 2018  Low.2018  Difference  \
Year                                                                        
1988      146.75     142.75  6504.208332     310.65    302.19         4.0   
1989      146.75     142.75  6818.166669     296.49    288.41         4.0   
1990      146.75     142.75  7186.208331     281.27    273.61         4.0   
1991      179.20     175.00  7490.541668     329.52    321.80         4.2   
1992      194.17     189.57  7717.416668     346.40    338.19         4.6   

      High*2   Low*2  
Year                  
1988  293.50  285.50  
1989  293.50  285.50  
1990  293.50  285.50  
1991  358.40  350.00  
1992  388.34  379.14  


      High.Value  Low.Value  CPI.Average  High 2018  Low.2018  Difference  \
Year                                                                        
1988    2.668182   2.595455   118.258333   5.648182  5.494364    0.072727   
1989    2.668182   2.595455   123.966667   5.390727  5.243818    0

# Additional Resources
Resources: 
1. https://plot.ly/python/
2. https://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/pandas_and_matplotlib.ipynb
3. https://cheatsheets.quantecon.org/stats-cheatsheet.html
4. https://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/sci_python_quickstart.ipynb
5. https://cheatsheets.quantecon.org/
6. https://seaborn.pydata.org/ 
7. https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/ 
8. https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/
9. https://paulovasconcellos.com.br/28-useful-pandas-functions-you-might-not-know-de42c59db085
10. https://realpython.com/python-pandas-tricks/
11. https://dataconomy.com/2015/03/14-best-python-pandas-features/