# Intro to Pandas
**Pandas** is an open source library that provides easy-to-use data structures and data analysis tools in Python. It is an essential tool that every data scientist working in python needs to know. Python allows for importing data from various formats, it offers two useful labeled array data structures called the Series and the Dataframes.<br>


Ref: http://pandas.pydata.org<br>

In this micro course, we'll learn how to use pandas. First, let's start with the concepts of "**Series**" and "**Dataframes**".

## Series

A series is used for a one-dimensional vector array with labels of each element in the vector as index. A series method can hold any type of data such as integer, scalar value, strings, Python objects, float, array, ndarray, etc.

The syntax for creating a series is
```python
    s = pd.Series(data, index=index)

```
An example is shown below on how to use the series in a program.

```python

    import pandas as pd
    import numpy as np
    list = pd.Series([2, 3, 4, 5, 6, 7])
    list
 
    Output:
 
    0    2
    1    3
    2    4
    3    5
    4    6
    5    7
    dtype: int64

```

An element can be accessed using its index or by its value. 

## Exercise

Assign apple, ball, cat, dog, elephant to a variable ' alphabet', and then access the element using the index for ball.
    

In [1]:
import pandas as pd
import numpy as np
#write your code below


## Dataframes

Dataframe is defined as a 2-dimensional labeled data structure with columns of potentially different types according to pydata.org. It is similar to a spreadsheet or SQL table, or a dict of Series objects.

<img src="dataframe.png" style="width: 300px"> 

To use pandas library, first you have to import the library and use the functions available in the library

```python
import pandas as pd
import numpy as np
pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
```

### Description

Let us read in a csv file into a dataframe object. The pandas dataframe is so versatile that it can easily read files from url links. 

You can think of Dataframes to be as important for a Data Scientist as Cutting is for a cook. To be able to learn the various operations is essential for you to be a strong Data Scientist. 

To understand how dataframe works, let us consider the datasets used by [FiveThirtyEight.com](https://FiveThirtyEight.com) to analyze the earning ecomonics of college majors and to do a midwest states survey.

In these examples let us read in a csv file into a dataframe object. The pandas dataframe is so versatile that it can easily read files from url links. 

### Reading into a Dataframe

Dataframes are part of a library called Pandas. One of the versatile features of pandas dataframe is that, it can read the data by just specifying the link in the read_csv() function.

One of the useful and fundamental commands that can list the first few rows of the dataframe, is the '**head**' function, which takes in number of rows (default is 5) to be listed in the top-down order. In contrast, '**tail**' function can list the last few rows of the dataframe (default is 5). <br>
 - List first five rows of the grad_students dataframe using the 'head' function.
 - List last five rows of the grad_students dataframe using the 'tail' function.

In [2]:
# Import the library
import pandas as pd
import numpy as np

grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

#Use .head() and .tail() functions
#write your code here


### Using head() or tail() function with a parameter

We can pass the number of rows as a parameter to head() or tail() function.

## Exercise:

- Now, try listing first 7 rows of the midwest_survey dataframe using the head() function and assign it to midwest_sample_head variable.
- List last 10 rows of the midwest_survey dataframe using the tail() function and assign it to midwest_sample_tail variable.

In [3]:
# Import the library
import pandas as pd
import numpy as np

midwest_survey_url = "https://raw.githubusercontent.com/colaberry/538data/master/region-survey/MIDWEST.csv"
midwest_survey = pd.read_csv(midwest_survey_url)

#Insert values in .head() function by passing the number of rows.
#write your code below


### Writing a Dataframe into a csv file

To save a dataframe into a csv file 'to_csv' is used. 

```python
    df.to_csv('filename.csv')
```    
         
where df is the name of the dataframe and filename.csv is the name you want your csv file to be named.

to_csv uses many arguments using 'sep'

```python
    df.to_csv(file_name, sep=',')
```  
As the Dataframe will be in Series form, the output will have indexes and if we want to avoid the indexes to be copied to the csv file, we need to pass 'False' value to the 'index' argument. 

Here we also define the parameter "encoding" as "utf-8", which is the default in Python 3. "utf-8" is used to encode text in any language, and it's a variable-sized encoding for Unicode. It can use from one to four 8-bit bytes to encode a Unicode code point, while "ASCII", the default in Python 2, only uses 7 bits and has more difficulties to store special characters.

```python
    df.to_csv(file_name, encoding='utf-8', index=False)
```

## Exercise

What is the syntax for writing a dataframe 'grad_students' to a csv file 'gradstudents.csv' without any index values stored in the csv file?


In [4]:
import pandas as pd
import numpy as np
grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)
#write your code below



## Index Based Selection

<img src="python_index_based_selection.png" style="width:35vw">
Indices can be used just as in arrays or lists when selection is in a sequence instead of multiple disconnected rows/columns. To extract rows 1 to 4 and columns 4 to 7, you can use:

```python
grad_students.iloc[0:3, 3:7]
```

## Exercise:

To select all the data in the row or column, you don't need to specify the start and end index numbers, but just the colon. 

 - Filter all rows for columns 2 to 6 and assign to variable grad_row_sample.
 - Print out the first 5 rows

In [16]:
#Use : in the column space
# Write your code below



## Row Selection

You could see that the column indices were shown in the head() command. The dataframes are indexed by starting index of 0 on both row and column similar to array indexing. Note that the upper bound is excluded from listing in the output. To list rows from 2 to 4 of a column 'Major':

```python
grad_students['Major'][1:4]
```

<img src="row_slice.png" style="width:30vw">

## Exercise:

 - Load rows 2 to 6 of column Major_category and assign it to variable major_category.
 - print out the variable major_category

In [6]:
# list rows 2:6 of Major_category
#note that the index starts with 0
#write your code below



## Column Selection

To list multiple columns use double square brackets. For example, to list Major_category, Grad_share you can use:
```python
grad_students[['Major_category', 'Grad_share']]
```

<img src="col_slice.png" style="width:25vw">

### Exercise:

From the Graduate Students dataset (stored in grad_students):
 - List columns of Major_code and Grad_employed for rows 4 to 8 and assign it to the variable, grad_students_sample.
 - print out grad_students_sample

In [7]:
# Write your code below:


## Assignments & Comparisons

### Adding a new column

Adding a new column with values is as easy as a variable assignment. To add a new column of percentage of employed grads:
```python
grad_students['emp_percent'] = (grad_students.Grad_employed/grad_students.Grad_total) * 100.0
```

### Renaming a Column

Renaming a column is easy by specifying a dictionary of old name as key and new name as value.
```python
grad_students = grad_students.rename(columns={'Major_category': 'Major_Category'}
```

### Comparisons

To determine those rows where sample size is greater than 200,
```python
grad_students[grad_students.Grad_sample_size > 200]
```

## Exercise:

List the Major Category of all the data where the employment percentage is greater than 80%.

 - Assign it to the variable major_emp_data and print it out.

In [8]:
import pandas as pd
import numpy as np
grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

grad_students['emp_percent'] = (grad_students.Grad_employed/grad_students.Grad_total) * 100.0
grad_students
#use grad_students.Major_category
#Write your code below:


Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium,emp_percent
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.096320,0.153846,77.379265
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000,75.174514
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000,75.226277
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,29738,23249,1661,0.052900,41600.0,29000,60000.0,0.125878,0.129808,66.346332
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,43163,34231,3389,0.072800,52000.0,36000,78000.0,0.144753,0.096154,82.467889
5,3201,COURT REPORTING,Law & Public Policy,1542,22,1008,860,0,0.000000,75000.0,...,6967,6063,518,0.069205,50000.0,34000,75000.0,0.147376,0.500000,65.369650
6,6206,MARKETING AND MARKETING RESEARCH,Business,190996,3738,151570,123045,8324,0.052059,80000.0,...,817906,662346,45519,0.052719,60000.0,40000,91500.0,0.156531,0.333333,79.357683
7,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,17488,386,13104,11207,473,0.034838,67000.0,...,71781,61335,1869,0.025377,55000.0,38000,80000.0,0.163965,0.218182,74.931382
8,2101,COMPUTER PROGRAMMING AND DATA PROCESSING,Computers & Mathematics,5611,98,4716,3981,119,0.024612,85000.0,...,22024,18381,2222,0.091644,60000.0,40000,85000.0,0.165394,0.416667,84.049189
9,1904,ADVERTISING AND PUBLIC RELATIONS,Communications & Journalism,33928,688,28517,22523,899,0.030562,60000.0,...,127832,100330,8706,0.063762,51000.0,37800,78000.0,0.171907,0.176471,84.051521


## Filtering

Dataframe has advanced operations that can be used to filter data. To retrieve data about grad students whose major is engineering and employed percentage is above 80,
```python
grad_students[(grad_students.Major_category == 'Engineering') & (grad_students.emp_percent > 80)]
```
Similarly you can use OR with the single symbol |. Note that there are no double parameters for comparison as in other languages such as C.


## Exercise:

List the grad students who are Major in "Art or Graphic Design" or "Engineering" whose unemployment rate is below 10%. 

 - Assign the result to variable, grad_select_list
 - Print the value

In [9]:
# Use | and & to build your command
#write your code below:



## Sorting & Grouping

### Sorting

To sort a column in a dataframe, you can use sort_values method and specify the column. To sort by "Grad_sample_size" in the ascending order in the grad student dataframe:
```python
grad_students.sort_values(by='Grad_sample_size')
# If you want to sort the dataframe by two or more columns, you can define the orders for each column. Here is the example:
grad_students.sort_values(by=['Grad_sample_size', 'Grad_employed'], ascending=[True, False])
# This will give you the sorted dataframe with ascending 'Grad_sample_size', then descending 'Grad_employed'. 

```
The above sort_values() function can sort a dataframe based on the passed values. While sort_index() function can sort by labels along the given axis. The diffrences between the two functions are that sort_index() is applied on the axis labels rather than the actual data. 

For example, we can randomly select 10 records from grad_students, then sort by index or columns.
```python
grad_students_sample = grad_students.sample(10)
grad_students_sample.sort_index(axis=0)     # sort by index
grad_students_sample.sort_index(axis=1)     # sort by column labels
```
There’s further power put into your hands by mastering the Pandas “groupby()” functionality. Groupby essentially splits the data into different groups depending on a variable of your choice. For example, the expression data.groupby(‘month’) will split our current DataFrame by month.

The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. the GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:

### Grouping

<img src="groupbyPython.png" style="width: 35vw">
You can use groupby extension to group by the elements of the column. 

Example 1:
```python
grad_students.groupby('Major_category')
```
Example 2:
```python
# If you want to group by two or more columns, you can simply use a list:
grad_students.groupby(['Major_category','Grad_median'])
```
The groupby() function will return a GroupBy object, it splits the data into different groups depending on the variables you choose. The first example above splits the dataframe grad_students by 'Major_category'. The second example splits the dataframe by two columns 'Major_category' and 'Grad_median' together.

The GroupBy object .groups returns a dictionary whose keys are the computed unique groups and corresponding values being the axis labels, which are index numbers here, belonging to each group.
```python
grad_students.groupby('Major_category').groups    # returns the dictionary
grad_students.groupby('Major_category').groups.keys()    # returns the keys in the above dictionary
grad_students.groupby('Major_category').groups.values()    # returns the values in the above dictionary
```
Functions like first(), last() can be quickly applied to the GroupBy object to obtain first or last entry for each group. For example:
```python
grad_students.groupby('Major_category').first()    # This gives the first entry for each "Major_category".
grad_students.groupby(['Major_category', 'Grad_median']).last()     # This gives the last entry for each "Major_category" and "Grad_median".
```

## Exercise:

You can use max() function to determine the max of any column. For example to determine the max employed graduates:
```python
grad_students.Grad_employed.max()
```
Can you determine what is the max graduates employed in each Major category?<br> 
How to get the total graduates employed for each Major category with the same "Grad_median"?

 - Assign the first result to the variable, grad_cat_max and print it out.
 - Assign the second result to the variable, grad_cat_sum.

In [10]:
# Modify the code below, Use groupby before using the max function.
grad_cat_max = ''
grad_cat_sum = ''


## Learn more about Pandas

In this notebook, we have given you a brief introduction of Pandas. Pandas has a lot more functions that can make your data manipulations easier. This notebook is meant to be a starting point for you to learn Pandas. If you want to be proficient with Pandas and become a data scientist in future, check out our course at https://refactored.ai. Our course on python covers everything from introductory python to pandas, to data visualization, statistics, and machine learning techniques.