# Pandas

Pandas is a popular Python library used for managing data in the form of DataFrames

****

## Creating or loading DFs

In [1]:
# import pandas module

import pandas as pd

Use `pd.DataFrame()` to create a data frame and pass in a `dict` object inside it. The `keys` will be the names of the columns and each value assigned to it will represent values for that column.

In [2]:
# Create a DataFrame object

family = pd.DataFrame({
    'name': ['Carlota', 'Gonzalo', 'María', 'Juan'], 
    'member': ['sister', 'brother', 'mom', 'dad'],
    'age': [22, 19, 50, 51]})

print(family)

      name   member  age
0  Carlota   sister   22
1  Gonzalo  brother   19
2    María      mom   50
3     Juan      dad   51


We can also create tables using `lists`

In [3]:
cats = pd.DataFrame([
    ['Midnight', 'black', 2],
    ['Sugarcane', 'tabby', 3],
    ['Moon', 'white', 1],
    ],
    columns=['name', 'color', 'age'])

print(cats)

        name  color  age
0   Midnight  black    2
1  Sugarcane  tabby    3
2       Moon  white    1


Alternatively, we can read a `csv` file using `pd.read_csv('filename.csv')`

In [4]:
grades = pd.read_csv('NY_grades.csv')

print(grades.head())

  Grade  Year Category  Number Tested  Mean Scale Score  Level 1 #  Level 1 %  \
0     3  2006    Asian           9768               700        243        2.5   
1     4  2006    Asian           9973               699        294        2.9   
2     5  2006    Asian           9852               691        369        3.7   
3     6  2006    Asian           9606               682        452        4.7   
4     7  2006    Asian           9433               671        521        5.5   

   Level 2 #  Level 2 %  Level 3 #  Level 3 %  Level 4 #  Level 4 %  \
0        543        5.6       4128       42.3       4854       49.7   
1        600        6.0       4245       42.6       4834       48.5   
2        907        9.2       4379       44.4       4197       42.6   
3       1176       12.2       4646       48.4       3332       34.7   
4       1698       18.0       4690       49.7       2524       26.8   

   Level 3+4 #  Level 3+4 %  
0         8982         92.0  
1         9079         91.

We can also save data to a `csv` file using `.to_csv()` on a DataFrame object

In [5]:
# cats.to_csv('cats.csv')

Use `.info()` to get information on the data and `.describe()` to get some summary statistics

In [11]:
grades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Grade             168 non-null    object 
 1   Year              168 non-null    int64  
 2   Category          168 non-null    object 
 3   Number Tested     168 non-null    int64  
 4   Mean Scale Score  168 non-null    int64  
 5   Level 1 #         168 non-null    int64  
 6   Level 1 %         168 non-null    float64
 7   Level 2 #         168 non-null    int64  
 8   Level 2 %         168 non-null    float64
 9   Level 3 #         168 non-null    int64  
 10  Level 3 %         168 non-null    float64
 11  Level 4 #         168 non-null    int64  
 12  Level 4 %         168 non-null    float64
 13  Level 3+4 #       168 non-null    int64  
 14  Level 3+4 %       168 non-null    float64
dtypes: float64(5), int64(8), object(2)
memory usage: 19.8+ KB


In [9]:
grades.describe()

Unnamed: 0,Year,Number Tested,Mean Scale Score,Level 1 #,Level 1 %,Level 2 #,Level 2 %,Level 3 #,Level 3 %,Level 4 #,Level 4 %,Level 3+4 #,Level 3+4 %
count,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0
mean,2008.5,30543.142857,678.458333,2876.238095,7.532143,7855.916667,21.653571,13487.130952,43.790476,6323.857143,27.016667,19810.988095,70.810714
std,1.712931,36902.292411,19.745038,5050.573585,6.23769,12254.927374,13.154316,17197.674686,10.131029,7023.296197,16.990838,22575.959582,18.705259
min,2006.0,9433.0,628.0,43.0,0.4,216.0,2.0,2762.0,24.3,605.0,2.3,6491.0,27.3
25%,2007.0,10200.75,664.0,333.0,2.875,1384.25,11.775,4617.5,36.275,2698.25,12.45,8706.25,54.225
50%,2008.5,21126.5,677.5,1346.5,5.15,3992.5,18.6,7422.5,42.95,4177.0,22.5,9975.5,76.25
75%,2010.0,28592.5,695.0,3270.75,11.775,9313.0,32.4,14053.0,50.825,5326.0,42.6,17366.0,86.05
max,2011.0,177382.0,716.0,33091.0,31.1,70036.0,49.6,102188.0,71.8,33594.0,64.0,132637.0,97.6


****

## Manipulating DFs

### Select columns

You can select a particular column as you would select a value in a dictionary: `df['col_name']` or `df.col_name`

Selecting a column and storing it in a variable will result in creating a `Series` object (instead of a `DataFrame`)

In [14]:
cats['name']

0     Midnight
1    Sugarcane
2         Moon
Name: name, dtype: object

In [13]:
cats.name

0     Midnight
1    Sugarcane
2         Moon
Name: name, dtype: object

Use a list to select multiple columns at the time

In [15]:
cats[['name', 'age']]

Unnamed: 0,name,age
0,Midnight,2
1,Sugarcane,3
2,Moon,1


### Select rows

To select a single row use the `.iloc[row_num]` method. Just like when selecting a single column, this will return a `Series` object. 

**NOTE**: DataFrames are _zero_ indexed, meaning they start with `0`

In [24]:
cats.iloc[2]

name      Moon
color    white
age          1
Name: 2, dtype: object

Use `.iloc[start:finish]` to select multiple rows

**NOTE**: start is _inclusive_ but finish is _not_ inclusive

In [25]:
cats.iloc[0:2]

Unnamed: 0,name,color,age
0,Midnight,black,2
1,Sugarcane,tabby,3


Another option is selecting rows using logical statements with `.loc[]`. The syntax is this case is the following:

`df[df.MyColumnName == desired_column_value]`

In [28]:
grades.loc[grades.Category == 'Asian']

Unnamed: 0,Grade,Year,Category,Number Tested,Mean Scale Score,Level 1 #,Level 1 %,Level 2 #,Level 2 %,Level 3 #,Level 3 %,Level 4 #,Level 4 %,Level 3+4 #,Level 3+4 %
0,3,2006,Asian,9768,700,243,2.5,543,5.6,4128,42.3,4854,49.7,8982,92.0
1,4,2006,Asian,9973,699,294,2.9,600,6.0,4245,42.6,4834,48.5,9079,91.0
2,5,2006,Asian,9852,691,369,3.7,907,9.2,4379,44.4,4197,42.6,8576,87.0
3,6,2006,Asian,9606,682,452,4.7,1176,12.2,4646,48.4,3332,34.7,7978,83.1
4,7,2006,Asian,9433,671,521,5.5,1698,18.0,4690,49.7,2524,26.8,7214,76.5
5,8,2006,Asian,9593,675,671,7.0,1847,19.3,4403,45.9,2672,27.9,7075,73.8
6,All Grades,2006,Asian,58225,687,2550,4.4,6771,11.6,26491,45.5,22413,38.5,48904,84.0
7,3,2007,Asian,9750,706,156,1.6,402,4.1,3886,39.9,5306,54.4,9192,94.3
8,4,2007,Asian,9881,704,209,2.1,564,5.7,3968,40.2,5140,52.0,9108,92.2
9,5,2007,Asian,10111,700,211,2.1,626,6.2,4257,42.1,5017,49.6,9274,91.7


You can also combine multiple logical statements

In [31]:
grades.loc[(grades.Grade == 4) | (grades.Year > 2006)]

Unnamed: 0,Grade,Year,Category,Number Tested,Mean Scale Score,Level 1 #,Level 1 %,Level 2 #,Level 2 %,Level 3 #,Level 3 %,Level 4 #,Level 4 %,Level 3+4 #,Level 3+4 %
7,3,2007,Asian,9750,706,156,1.6,402,4.1,3886,39.9,5306,54.4,9192,94.3
8,4,2007,Asian,9881,704,209,2.1,564,5.7,3968,40.2,5140,52.0,9108,92.2
9,5,2007,Asian,10111,700,211,2.1,626,6.2,4257,42.1,5017,49.6,9274,91.7
10,6,2007,Asian,9808,694,343,3.5,778,7.9,4356,44.4,4331,44.2,8687,88.6
11,7,2007,Asian,9779,685,333,3.4,1220,12.5,4255,43.5,3971,40.6,8226,84.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,5,2011,White,10808,699,311,2.9,1709,15.8,4532,41.9,4256,39.4,8788,81.3
164,6,2011,White,9875,695,409,4.1,1818,18.4,3435,34.8,4213,42.7,7648,77.4
165,7,2011,White,9679,690,423,4.4,1739,18.0,3023,31.2,4494,46.4,7517,77.7
166,8,2011,White,9570,688,433,4.5,2190,22.9,4142,43.3,2805,29.3,6947,72.6


Another useful tool is using the `.isin()` method, which takes a list as argument an helps seaching inside the potential values within it for selecting rows

In [32]:
grades.loc[grades.Category.isin(['Asian', 'Black'])]

Unnamed: 0,Grade,Year,Category,Number Tested,Mean Scale Score,Level 1 #,Level 1 %,Level 2 #,Level 2 %,Level 3 #,Level 3 %,Level 4 #,Level 4 %,Level 3+4 #,Level 3+4 %
0,3,2006,Asian,9768,700,243,2.5,543,5.6,4128,42.3,4854,49.7,8982,92.0
1,4,2006,Asian,9973,699,294,2.9,600,6.0,4245,42.6,4834,48.5,9079,91.0
2,5,2006,Asian,9852,691,369,3.7,907,9.2,4379,44.4,4197,42.6,8576,87.0
3,6,2006,Asian,9606,682,452,4.7,1176,12.2,4646,48.4,3332,34.7,7978,83.1
4,7,2006,Asian,9433,671,521,5.5,1698,18.0,4690,49.7,2524,26.8,7214,76.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,5,2011,Black,19817,675,1786,9.0,7928,40.0,7876,39.7,2227,11.2,10103,51.0
80,6,2011,Black,20312,667,3122,15.4,8544,42.1,6335,31.2,2311,11.4,8646,42.6
81,7,2011,Black,21095,664,3163,15.0,9100,43.1,6189,29.3,2643,12.5,8832,41.9
82,8,2011,Black,21555,663,3419,15.9,9789,45.4,6838,31.7,1509,7.0,8347,38.7


### Setting indeces

When subsetting a `DataFrame` we need to reset the indeces using `.reset_index(drop=True, inplace=True)` 

In [45]:
parents = family.loc[family.member.isin(['mom', 'dad'])]
parents.reset_index(drop=True, inplace=True)
print(parents)

    name member  age
0  María    mom   50
1   Juan    dad   51
