# The Pandas Library

Pandas is an open source library that provides easy-to-use data structures and data analysis tools in Python. This library offers a two useful labelled array data structure called the Series and the Dataframes.

Ref: http://pandas.pydata.org

## Series

A series is used for a one-dimensional vector array with labels of each element in the vector as index. A series method can hold any type of data such as integer, scalar value, strings, Python objects, float, array, ndarray, etc.

The syntax for creating a series is

```python
     
    s = pd.Series(data, index=index)

```
An example is shown below on how to use the series in a program.

```python

    import pandas as pd
    import numpy as np
    list = pd.Series([2, 3, 4, 5, 6, 7])
    list
 
    Output:
 
    0    2
    1    3
    2    4
    3    5
    4    6
    5    7
    dtype: int64

```

An element can be accessed using its index or by its value. 

## Exercise

Assign apple, ball, cat, dog, elephant to a variable ' alphabet', and then access the element using the index for ball.
    

In [3]:
import pandas as pd
import numpy as np


### Solution

```python
alphabet = pd.Series(['apple', 'ball', 'cat', 'dog', 'elephant'])
alphabet[1]
```

## Dataframes

Dataframes is defined as a 2-dimensional labeled data structure with columns of potentially different types according to pydata.org. It is similar to a spreadsheet or SQL table, or a dict of Series objects.

<img src="https://s3.amazonaws.com/rfv2/dataframe.png", style="width: 300px;">

To use pandas library, first you have to import the library and use the funcations available in the library

```python
import pandas as pd
import numpy as np
pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
```

### Description

Let us read in a csv file into a dataframe object. The pandas dataframe is so versatile that it can easily read files from url links. 

You can think of Dataframes to be as important for a Data Scientist as Cutting is for a cook. To be able to learn the various operations is essential for you to be a strong Data Scientist. 

To understand how dataframe works, let us consider the datasets used by FiveThirtyEight.com to analyzie the earning ecomonics of college majors and to do a midwest states survey.

In these examples let us read in a csv file into a dataframe object. The pandas dataframe is so versatile that it can easily read files from url links. 

### College Majors datasets

Article ref: The Economic Guide To Picking A College Major: 
https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

"Major categories are from Carnevale et al: "What's It Worth?: The Economic Value of College Majors." Georgetown University Center on Education and the Workforce, 2011. http://cew.georgetown.edu/whatsitworth"

Data is from American Community Survey 2010-2012 Public Use Microdata Series from:

http://www.census.gov/acs/www/data_documentation/pums_data/

CSV datasets can be found at: https://github.com/colaberry/538data/tree/master/college-majors

### Midwest Survey datasets

Example context:

"Here’s a somewhat regular argument I get in: Which states make up which regions of the United States? Some of these regions — the West Coast, Mountain States, Southwest and Northeast are pretty clearly defined — but two other regions, the South and the Midwest, are more nebulous.

I’m from New York, and I generally consider anything west of Philadelphia the Midwest. This admittedly unsophisticated designation is frequently criticized by self-avowed Midwesterners. My boss, originally of Michigan, has many opinions about what, precisely, falls into the Midwest. So I decided to find out which states Midwesterners consider to be in their territory.

To get this broad-based view, we asked SurveyMonkey Audience to ask self-identified Midwesterners which states make the cut. We ran a national survey that targeted the Midwest from March 12 to March 17, with 2,778 respondents. Of those, 1,357 respondents identified “a lot” or “some” as a Midwesterner. We then asked this group to identify the states they consider part of the Midwest."

Article ref: Which States Are in the Midwest? https://fivethirtyeight.com/datalab/what-states-are-in-the-midwest/

SOUTH.csv and MIDWEST.csv contain individual responses from surveys about regional identification conducted for FiveThirtyEight by SurveyMonkey. 

CSV datasets can be found at: https://github.com/colaberry/538data/tree/master/region-survey

## Exercise:

### Reading into a Dataframe

Dataframes are a part of the library called Pandas. One of the versatile features of pandas dataframe is that, it can read the data by just specifying the link in the read_csv() function.

 - One of the useful and fundamental commands that can list the first few rows of the dataframe, is the 'head' function, which takes in number of rows (default is 5) to be listed in the top-down order. 
 - List first five rows of the grad_students dataframe using the 'head' function and assign it to grad_sample variable. 

In [4]:
# Import the library
import pandas as pd
import numpy as np

grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

#Use .head() function
#write your code here


### Solution

```python
grad_sample = grad_students.head()
grad_sample
```

## Using head() function with a parameter

We can pass the number of rows as a parameter to head() function.

### Exercise:

- Now, try listing first 7 rows of the midwest_survey dataframe using the head() function and assign it to midwest_sample variable.

In [5]:
# Import the library
import pandas as pd
import numpy as np

midwest_survey_url = "https://raw.githubusercontent.com/colaberry/538data/master/region-survey/MIDWEST.csv"
midwest_survey = pd.read_csv(midwest_survey_url)

#Use value in .head() function by passing the number of rows.
#write your code below


### Solution

```python
midwest_sample = midwest_survey.head(7)
midwest_sample
```

### Writing a Dataframe into a csv file

To save a dataframe into a csv file 'to_csv' is used. 

```python
    df.to_csv('filename.csv')
```    
         
where df is the name of the dataframe and filename.csv is the name you want your csv file to be named.

to_csv uses many arguments using 'sep'

```python
    df.to_csv(file_name, sep=',')
```  
As the Dataframe will in Series form, the output will have indexes and if we want to avoid the indexes to be copied to the csv file, we need to pass 'False' value to the 'index' argument.

```python
    df.to_csv(file_name, encoding='utf-8', index=False)
```

## Exercise

What is the syntax for writing a dataframe 'grad_students' to a csv file 'gradstudents.csv' without any index values stored in the csv file?


In [6]:
import pandas as pd
import numpy as np
grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)



### Solution

```python
student_doc = "gradstudents.csv"
grad_students.to_csv(student_doc, encoding='utf-8', index=False)

```

## Categorical & Dummy Variables


<img src="../../../images/types_of_data.png" style="width: 700px;">

Categorical variables are variables that can take a value in a fixed set. For example, day of the week,

```python
day = {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}
```
Frequently, Data Scientists encounter columns which have categorical variables in the columns. For example, consider the dataframe which contains data about various purchases tabulated by each day and the sex of the person who made the purchase. We can create a dataframe from the dictionary of values:

```python
raw_data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
                    'Sunday', 'Monday', 'Tuesday', 'Wednesday'],
            'Sex': ['male', 'female', 'male', 'female', 'female', 'female', 'female',
                    'male', 'male', 'female'],
            'Total_Amount': ['100.0', '30', '40', '70', '90', '20', '50', '30',
                             '60', '70']}

checks = pd.DataFrame(raw_data, columns = ['Day', 'Sex', 'Total_Amount'])
```
checks dataframe now contains:

<pre>
  Total_Amount        Day     Sex
0        100.0     Monday    male
1           30    Tuesday  female
2           40  Wednesday    male
3           70   Thursday  female
4           90     Friday  female
5           20   Saturday  female
6           50     Sunday  female
7           30     Monday    male
8           60    Tuesday    male
9           70  Wednesday  female
</pre>

Certain Machine Learning algorithms cannot make sense of the categorical variables such as Day or Sex. Hence, we need to transform them to numbered codes that represent such unique values. To do so, we can use dummy variables. Dummy variables transform the day of the week to a value:

```python
day_dummies = pd.get_dummies(checks['Day'])
day_dummies
```
<pre>
   Friday  Monday  Saturday  Sunday  Thursday  Tuesday  Wednesday
0       0       1         0       0         0        0          0
1       0       0         0       0         0        1          0
2       0       0         0       0         0        0          1
3       0       0         0       0         1        0          0
4       1       0         0       0         0        0          0
5       0       0         1       0         0        0          0
6       0       0         0       1         0        0          0
7       0       1         0       0         0        0          0
8       0       0         0       0         0        1          0
9       0       0         0       0         0        0          1
</pre>

You can see that there are as many columns as there are unique values in the categorical variable. Also each categorical variable such as Monday, Tuesday have unique set of 1s and 0s.

<br/>

### Exercise:

 - Transform the 'sex' column to a categorical variable and append to the dataframe, checks.

In [12]:
raw_data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday', 'Monday', 'Tuesday', 'Wednesday'],
                      'Sex': ['male', 'female', 'male', 'female', 'female', 'female', 'female', 'male', 'male', 'female'],
                      'Total_Amount': ['100.0', '30', '40', '70', '90', '20', '50', '30','60', '70']}
checks = pd.DataFrame(raw_data, columns = ['Total_Amount', 'Day', 'Sex'])

#Use pd.concat([<dataframe>, <sex dataframe>, axis=1) and display the dataframe.
# Add code to create a new dummy variable out of Sex column and append to checks data frame.



### Solution

```python
sex_dummies = pd.get_dummies(checks['Sex'])
checks = pd.concat((checks, sex_dummies), axis=1)
```

# Indexing 

Indexing is done using labels which is attributes used for column representation and numeric ranges

```python
    grad_students['Major']
```

We can also pass a set of attribute names to select those colums from the dataset.

```python
    grad_students[['Major', 'Grad_total']]
```


# Slicing

Slicing is done to pull out a set of rows from the dataset.

```python
    grad_students[2:5]
```

To slice out a set of rows, you use the following syntax: data[start:stop].

[[We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

loc: indexing via labels or integers
iloc: indexing via integers ]] --------- explanation for iloc and loc

### Slicing Data based on Criteria
We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002.

```python
    grad_students[grad_students.Grad_total <= 10000]
```

## Exercise

Print the rows of the data with the condition of grad_unemployed being greater than equal to 700 and store it in a variable 'unempl_num'


In [13]:
import pandas as pd
import numpy as np
grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)



### Solution


```python
unempl_num = grad_students[grad_students.Grad_unemployed >= 700]
unempl_num
```



<br/><br/><br/>
## Row Selection

You could see that the column indices were shown in the head() command. The dataframes are indexed by starting index of 0 on both row and column similar to array indexing. Note that the upper bound is excluded from listing in the output. To list rows from 2 to 6 of a column 'Major':

```python
grad_students['Major'][1:6]
```
<img src="https://s3.amazonaws.com/rfv2/row_slice.png", style="width: 350px;">

<br/>
## Exercise:

 - Load rows 2 to 6 of column Major_category and assign it to variable major_category.
 - print out the variable major_category

In [14]:
# list rows 2:6 of Major_category
#note that the index starts with 0
#write your code below


### Solution

```python
major_category = grad_students['Major_category'][1:6]
major_category
```



<br/><br/><br/>
## Column Selection

To list multiple columns use double square brackets. For example, to list Major_category, Grad_share you can use:
```python
grad_students[['Major_category', 'Grad_share']]
```

<img src="https://s3.amazonaws.com/rfv2/col_slice.png", style="width: 350px;">

<br/>

### Exercise:

From the Graduate Students dataset (stored in grad_students):
 - List columns of Major_code and Grad_employed for rows 4 to 8 and assign it to the variable, grad_students_sample.
 - print out grad_students_sample

In [19]:
grad_students[['Major_code', 'Grad_employed']]



Unnamed: 0,Major_code,Grad_employed
0,5601,7098
1,6004,40492
2,6211,18368
3,2201,3590
4,2001,7512
5,3201,1008
6,6206,151570
7,1101,13104
8,2101,4716
9,1904,28517


### Solution
```python
grad_students_sample = grad_students[['Major_code', 'Grad_employed']][3:8]
grad_students_sample
```



<br/><br/><br/>
## Assignments & Comparisons

### Adding a new column

Adding a new column with values is as easy as a variable assignment. To add a new column of percentage of employed grads:
```python
grad_students['emp_percent'] = (grad_students.Grad_employed/grad_students.Grad_total) * 100.0
```

### Renaming a Column

Renaming a column is easy by specifying a dictionary of old name as key and new name as value.
```python
grad_students = grad_students.rename(columns={'Major_category': 'Major_Category'}
```

### Comparisons

To determine those rows where sample size is greater than 200,
```python
grad_students[grad_students.Grad_sample_size > 200]
```

<br/>
## Exercise:

List the Major Category of all the data where the employment percentage is greater than 80%.

 - Assign it to the variable major_emp_data and print it out.

In [17]:
grad_students['emp_percent'] = (grad_students.Grad_employed/grad_students.Grad_total) * 100.0
grad_students[grad_students.emp_percent > 80]

#use grad_students.Major_category
#Write your code below:

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium,emp_percent
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,43163,34231,3389,0.0728,52000.0,36000,78000.0,0.144753,0.096154,82.467889
8,2101,COMPUTER PROGRAMMING AND DATA PROCESSING,Computers & Mathematics,5611,98,4716,3981,119,0.024612,85000.0,...,22024,18381,2222,0.091644,60000.0,40000,85000.0,0.165394,0.416667,84.049189
9,1904,ADVERTISING AND PUBLIC RELATIONS,Communications & Journalism,33928,688,28517,22523,899,0.030562,60000.0,...,127832,100330,8706,0.063762,51000.0,37800,78000.0,0.171907,0.176471,84.051521
13,1903,MASS MEDIA,Communications & Journalism,42915,828,35939,28054,1957,0.051641,57000.0,...,153722,117581,12816,0.076955,50000.0,35000,72000.0,0.184236,0.14,83.744611
15,2107,COMPUTER NETWORKING AND TELECOMMUNICATIONS,Computers & Mathematics,11165,218,9037,7988,803,0.081606,80000.0,...,41552,34402,2476,0.056237,58000.0,37700,84000.0,0.186266,0.37931,80.940439
17,2599,MISCELLANEOUS ENGINEERING TECHNOLOGIES,Engineering,14816,315,12433,11146,407,0.031698,80000.0,...,50092,44199,3316,0.062088,65000.0,43000,90000.0,0.196533,0.230769,83.916037
20,5301,CRIMINAL JUSTICE AND FIRE PROTECTION,Law & Public Policy,188228,3794,154146,133549,6591,0.041005,68000.0,...,565748,490943,29133,0.048973,52000.0,36000,75000.0,0.213027,0.307692,81.893236
23,6212,MANAGEMENT INFORMATION SYSTEMS AND STATISTICS,Business,41970,963,36227,32121,1459,0.038715,89000.0,...,129179,115177,5690,0.042189,72000.0,50000,100000.0,0.218503,0.236111,86.316416
24,2106,COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY,Computers & Mathematics,10290,194,8554,7581,542,0.059587,81000.0,...,29740,26303,2102,0.066013,60000.0,41600,80000.0,0.222858,0.35,83.129252
25,6204,OPERATIONS LOGISTICS AND E-COMMERCE,Business,15056,335,12659,10861,296,0.022848,94000.0,...,43074,38454,1827,0.04069,66000.0,46800,92000.0,0.224385,0.424242,84.079437


### Solution

```python
major_emp_data = grad_students.Major_category[grad_students.emp_percent > 80]
major_emp_data
```



<br/><br/><br/>
## Statistical Computations

### Statistical Measures

statistics such as mean and median can be easily computed on selective values in the dataframe. 

To compute mean on Grad_employed:
```python
grad_students.Grad_employed.mean()
```
To calculate the sum of each column:
```python
grad_students.Grad_employed.sum()
```

### Sparsity or Missing Values

<img src="../../../images/python_missing_values.png", style="width: 700px;">

In many real world datasets, it is common to encounter missing or null values. 

To find missing or null values in a column, you can use the isnull() function. For this, let us study the midwest survey dataset.
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].isnull()
```
To drop a row where the above column value is null:
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].dropna()
```

### Sparsity Substitution

When a few values are missing, it is not worth throwing away the entire row as we would be missing a lot of useful information.  We can fill such columns with 'NA' and preserve the row for later analysis
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].fillna(value='NA')
```
The above statement does not change the original dataframe. We can do that by specifying inplace=True
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].fillna(value='NA', inplace=True)
```
We can also substitute with another value:
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].fillna(value=0.5)
```
To see statistical data of an entire dataframe:

```python
grad_students.describe(include='all')
```

<br/>
## Exercise:

Determine the Major for which the Grad_employed is maximum.

 - Assign it to variable grad_emp_max
 - print it out.

In [18]:
# Use grad_students.Grad_employed.max()
#Use == comparator and find out which row/s contains the maximum value. Then print out the Major.
#write your code below:

### Solution

```python
grad_emp_max = grad_students[grad_students.Grad_employed == grad_students.Grad_employed.max()].Major
grad_emp_max
```




<br/><br/><br/>
## Advanced operations - Dataframe

Dataframe has advanced operations that can be used to filter data. To retrieve data about grad students whose major is engineering and employed percentage is above 80,
```python
grad_students[(grad_students.Major_category == 'Engineering') & (grad_students.emp_percent > 80)]
```
Similarly you can use OR with the single symbol |. Note that there are no double parameters for comparison as in other languages such as C.


## Exercise:

List the grad students who are Major in "Art or Graphic Design" or "Engineering" whose unemployment rate is below 10%. 

 - Assign the result to variable, grad_select_list
 - Print the value

In [19]:
# Use | and & to build your command

#write your code below:


### Solution

```python
grad_select_list = grad_students[((grad_students.Major == 'ENGINEERING') | (grad_students.Major == 'COMMERCIAL ART AND GRAPHIC DESIGN')) & (grad_students.Grad_unemployment_rate < 0.1)]
grad_select_list
```



<br/><br/><br/>
## Index Based Selection

<img src="../../../images/python_index_based_selection.png" style="width: 700px;">
Indices can be used just as in arrays or lists when selection is in a sequence instead of multiple disconnected rows/columns. To extract rows 1 to 4 and columns 4 to 7, you can use:
```python
grad_students.ix[0:3, 3:7]
```

## Exercise:

To select all the data in the row or column, you don't need to specify the start and end index numbers, but just the colon. 

 - Filter all rows for columns 2 to 6 and assign to variable grad_row_sample.
 - Print out the first 5 rows

In [31]:
#Use : in the column space
# Write your code below



### Solution

```python
grad_row_sample = grad_students.iloc[:, 1:6]
grad_row_sample
```



<br/><br/><br/>
## Sorting & Grouping

### Sorting

To sort a column in a dataframe, you can use sort _index method and specify the column. To sort by sample size in the ascending order in the grad student dataframe:
```python
grad_students.sort_index(by='Grad_sample_size')
```

## Grouping

<img src="../../../images/groupbyPython-1.png">
You can use groupby extension to group by the elements of the column.
```python
grad_students.groupby('Major_category')
```
<br/>

## Exercise:

You can use max() function to determine the max of any column. For example to determine the max employed graduates:
```python
grad_students.Grad_employed.max()
```
Can you determine what is the max graduates employed in each Major category?

 - Assign the result to the variable, grad_cat_max and print it out.

In [20]:
# Modify the code below, Use groupby before using the max function.
grad_cat_max = ''



### Solution

```python
grad_cat_max = grad_students.groupby('Major_category').Grad_employed.max()
grad_cat_max
```

# Reshaping

### Pivot 

The Pivot method is used to seperate and extract a new table from the original dataset.This methods are very useful in summerizing the data in the dataset. They can automatically perform mean, sort, count from the dataframe and store the result in a new table. This method allows us to group or organize data as per our wish from the existing dataset. 

The Pivot method can have index, columns, and values as arguments and they do not create a duplicate values for the specified column.


### Pivot Table

The pivot_table method overcomes the disadvantage of pivot method. The Pivot table performs the same role as Pivot method but it combines the values from with duplicate values for the specified columns.



```python 
import pandas as pd
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'TestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'TestScore'])
df

Output:

    regiment    company TestScore
0   Nighthawks  1st     4
1   Nighthawks  1st     24
2   Nighthawks  2nd     31
3   Nighthawks  2nd     2
4   Dragoons    1st     3
5   Dragoons    1st     4
6   Dragoons    2nd     24
7   Dragoons    2nd     31
8   Scouts      1st     2
9   Scouts      1st     3
10  Scouts      2nd     2
11  Scouts      2nd     3

```

a pivot table can be created for a group means, by company and regiment


pd.pivot_table(df, index=['regiment','company'], aggfunc='mean')

Pivot table method is a powerful method for analysis for which one need to first understand the data properly.


### Stack/Unstack

Pivoting a table is a special case of stacking a DataFrame. If suppose we have a DataFrame with MultiIndices on the rows and columns. Stacking a DataFrame means moving (also rotating or pivoting) the innermost column index to become the innermost row index. The inverse operation is called unstacking. It means moving the innermost row index to become the innermost column index.

## Exercise

Print out a pivot table on grad_students for a group max by Grad_unemployed and Grad_employed and store it in a variable mean_pivot

In [21]:
#write your code below:


### Solution

```python
mean_pivot = pd.pivot_table(grad_students, index=['Grad_employed','Grad_unemployed'], aggfunc='max')
mean_pivot
```