# Hands-on: `pandas` & Data Cleaning

This hands-on will cover handling data using the `pandas` library.
- Reading in data
- Descriptive statistics
- Data wrangling
- Filtering
- Aggregation
- Merging

_This notebook was derived from the Introduction to Data Science notebooks by Unisse Chua, Jude Teves and Sashmir Yap of the Data Science Institute._

In [6]:
# all imports are placed at the top of the notebook 
# to ensure that any library needed to fully run the notebook are loaded (and installed)
import os

import pandas as pd
import numpy as np

from pathlib import Path
# -> we import path, when we share it to other people (same or different OS), more or less, 
# the way that we read in data would be the same.
# -> a different way of handling the file system as the way that we read in files in 
# Windows and Mac is different. If I didn't use it like this and I sent you a notebook,
# you would need to keep changing the path names of the data files that I read in.
# -> would be needed to make the notebook reproducible (if I give you a notebook and you run it,
# it should run without problems without modification).

### Reproducibility

With data science, or any project that requires data processing and analysis, it is important to ensure that the code provided can be replicated by other members of the team. One such way is by setting up an **environment variable** that represents a central location for the data files.

Steps: 
1. Create a folder in your computer to serve as the location for all the data files of your project.
2. Save the directory path as an environment variable. The name of the variable should be the same for all team members so that everyone would simply have to use this name across their code.

This folder will host **only DATA files**. Code may be kept elsewhere. The code simply needs to reference to this folder when accessing and saving data files.

In [7]:
DSDATA = Path(os.getenv('DSDATA'))
DSDATA

# -> for this to work, each one of us should have the environment set-up so that when we use it
# we would need to just use the environment variable, and we won't have to worry about the path
# of the datapre folder.

WindowsPath('C:/Users/User/Desktop/Uni/3rdYear/2ndTerm/DATAPRE/notebooks/datapre-notebooks/data')

## Data

The Philippines has an Open Data portal: https://data.gov.ph

For this exercise, we'll be using the [Public Elementary School Enrollment Statistics](https://data.gov.ph/?q=dataset/public-elementary-school-enrollment-statistics) provided by the Department of Education. The page contains two files. Download both files and save them to the same folder as this notebook.

## Reading Data

"`pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language." - [Source](https://pandas.pydata.org/pandas-docs/stable/)

Commonly used file types for published *(semi-)* structured data are **CSV (or TSV)** and sometimes even **Excel** (although this is not an 'open' format). In `pandas`, it's straightforward to read in these types of files.

In [8]:
pd.read_csv?

# -> TSV is Tab-separated values

In [9]:
pd.read_excel?

# -> Excel requires you to install a paid application (it has a subscription), which is why
# it is NOT an open format. Excel requires you to open their files on their application.

In [10]:
# by default, the encoding is utf-8, but since the data has some latin characters
# the encoding argument needs to be updated
deped2012 = pd.read_csv(DSDATA / 'deped_publicelementaryenrollment2012.csv', encoding='latin1')

# -> If you are going to be putting the relative path, you have to invert the \ in the file path and 
# make it into /.
# -> This is what we mean when we say that it's not reproducible. If you used a relative
# path in your notebook, it would produce an error when it is ran in another device.

# the head function provides a preview of the first 5 rows of the data
deped2012.head()

Unnamed: 0,school_id,school_name,region,province,municipality,division,district,year_level,gender,enrollment
0,101746,"A. Diaz, Sr. ES",I (Ilocos Region),Pangasinan,Bautista,"Pangasinan II, Binalonan",Bautista,grade 1,male,53
1,102193,A. P. Santos ES (SPED Center),I (Ilocos Region),Ilocos Norte,Laoag City (Capital),Laoag City,Laoag City District II,grade 1,male,31
2,101283,A.P. Guevarra IS,I (Ilocos Region),Pangasinan,Bayambang,"Pangasinan I, Lingayen",Bayambang II,grade 1,male,16
3,100216,Ab-Abut ES,I (Ilocos Region),Ilocos Norte,Piddig,Ilocos Norte,Piddig,grade 1,male,19
4,100043,Abaca ES,I (Ilocos Region),Ilocos Norte,Bangui,Ilocos Norte,Bangui,grade 1,male,12


In [11]:
# Let's read in the other file too
deped2015 = pd.read_csv(DSDATA / 'depend_publicelementaryenrollment2015.csv', encoding='latin1')
deped2015.head()

Unnamed: 0,region,province,municipality,division,school_id,school_name,year_level,gender,enrollment,latitude,longitude
0,Region I - Ilocos Region,Ilocos Norte,Bacarra,Ilocos Norte,100001,Apaleng-libtong ES,grade 1,male,9,18.253666,120.60618
1,Region I - Ilocos Region,Ilocos Norte,Bacarra,Ilocos Norte,100002,Bacarra CES,grade 1,male,41,18.25096389,120.6089583
2,Region I - Ilocos Region,Ilocos Norte,Bacarra,Ilocos Norte,100003,Buyon ES,grade 1,male,7,18.234599,120.616037
3,Region I - Ilocos Region,Ilocos Norte,Bacarra,Ilocos Norte,100004,Ganagan Elementary School,grade 1,male,8,18.25001389,120.5871694
4,Region I - Ilocos Region,Ilocos Norte,Bacarra,Ilocos Norte,100005,Macupit ES,grade 1,male,5,18.29399444,120.6410194


### Let's begin exploring the data...

These are the basic descriptive statistics that we are required to check every single time that we load the file. This would allow us to check if we were able to load the file properly (e.g., if there are more rows loaded).
* How many rows and columns do we have? 
* What is the data type of each column? (--> Would let us understand what type of processing would be done with it)
* What is the most common value? Mean? Standard deviation?

#### `shape`

A `pandas` `DataFrame` is essentially a 2D `numpy` array. Using the `shape` attribute of the `DataFrame`, we can easily check the dimensions of the data file we read. It returns a tuple of the dimensions.

In [12]:
deped2012.shape

# -> shape is not a function, but rather an attribute
# -> would allow us to know the number of columns and rows
# 463908 rows, 10 columns

(463908, 10)

In [13]:
deped2015.shape

# -> Even though we both got these two files from the same source (DepEd). There is a difference
# between the two, in terms of the number of columns.
# -> The order of their columns are is also different.
# -> This means that we cannot concatenate the files because we have two different tables.

(396288, 11)

#### `dtypes` 
`dtypes` lets you check what data type each column is.

In [14]:
deped2012.dtypes

# -> object is usually just a stirng

school_id        int64
school_name     object
region          object
province        object
municipality    object
division        object
district        object
year_level      object
gender          object
enrollment       int64
dtype: object

In [15]:
deped2015.dtypes

# -> latitude and longitude are strings. That might be a problem as they are
# numerical, so why are they read in string? We might need to investigate
# why this is so.

region          object
province        object
municipality    object
division        object
school_id        int64
school_name     object
year_level      object
gender          object
enrollment       int64
latitude        object
longitude       object
dtype: object

#### `describe()`
`describe()` provides the basic descriptive statistics of the`DataFrame`. By default, it only includes the columns with numerical data. Non-numerical columns are omitted but there are arguments that shows the statistics related to non-numerical data.

In [16]:
deped2012.describe()

# -> In the school ID column, does it make sense to get the mean and standard
# deviation? Not really important at this point because it's just sequential 
# number. 
# -> The descriptive statistics is important in the enrollment count (e.g.,
# you would see that the minimum value is 0, which means nobody really
# enrolled for that year in a specific school).

Unnamed: 0,school_id,enrollment
count,463908.0,463908.0
mean,123102.43982,28.582152
std,22010.932343,44.727529
min,100001.0,0.0
25%,109747.0,9.0
50%,119533.0,16.0
75%,129325.0,31.0
max,261503.0,1047.0


In [17]:
deped2012.describe(include=object)
# -> originally: deped2012.describe(include=np.object), but shows a 
# warning
# -> Usually, for objects, we do not use describe as there are other ways 
# to see the summary of it using other functions. However, you can include
# objects by using include.
# -> top -> most common values found in the data set.
# -> frequency -> the number of times that the top value was found in the data
# set

Unnamed: 0,school_name,region,province,municipality,division,district,year_level,gender
count,463908,463908,463908,463908,463908,463908,463908,463908
unique,29699,17,86,1437,206,2415,6,2
top,San Isidro ES,VIII (Eastern Visayas),Leyte,Davao City,Leyte,Rizal,grade 1,female
freq,2328,43728,15612,3420,14136,1608,77318,231954


### Data Wrangling

After looking at the basic information about the data, let's see how "clean" the data is

#### Common Data Problems (from slides)
1. Missing values
2. Formatting issues / data types
3. Duplicate records
4. Varying representation / Handle categorical values

#### `isna()` / `isnull()`

To check if there's any missing values, `pandas` provides these two functions to detect them. This actually maps each individual cell to either True or False.

These two functions return the same values, so it doesn't really matter which of the two we use.

#### `dropna()`

To remove any records with missing values, `dropna()` may be used. It has a number of arguments to help narrow down the criteria for removing the records with missing values.

In [19]:
deped2012.isna().sum()

# -> If all are 0, then it's clean--there's no null or na values.
# -> If we don't use the sum function, then it would return the table with
# true/false values.

school_id       0
school_name     0
region          0
province        0
municipality    0
division        0
district        0
year_level      0
gender          0
enrollment      0
dtype: int64

In [22]:
deped2012.dropna?
# -> Used if we want to drop the missing values according to the index 
# -> would drop rows that contain missing values.
# -> how -> a parameter that would determine what values would be dropped;
# if how value is 'any', then any row with at least one missing value would
# be dropped. if how value is 'all', then a row would only be dropped if 
# all values of the row is missing.

#### `duplicated()` --> `drop_duplicates()`

The `duplicated()` function returns the duplicated rows in the `DataFrame`. It also has a number of arguments for you to specify the subset of columns. 

`drop_duplicates()` is the function to remove the duplicated rows found by `duplicated()`.

In [23]:
deped2012.duplicated().sum()

0

In [28]:
deped2012.columns

# -> to get the values of the columns

Index(['school_id', 'school_name', 'region', 'province', 'municipality',
       'division', 'district', 'year_level', 'gender', 'enrollment'],
      dtype='object')

In [29]:
deped2012.duplicated(subset=['school_id', 'school_name', 'region', 'province', 'municipality',
       'division', 'district', 'year_level']).sum()

# -> this would allow us to check the columns only that we specify
# -> these combinations should be unique

231954

In [30]:
deped2012[deped2012.duplicated(subset=['school_id', 'school_name', 'region', 'province', 'municipality','division', 'district', 'year_level'])]

# -> NOTE: gender was not added as a unique column to see the use of
# duplicated
# -> would show all the duplicated values
# -> In this example, we would be seeing the female copies only as the dataset
# was arranged in a way that the male rows were in front, which was why the 
# function saw the female entries as the duplicated rows.
# -> this is because of the default value of the "keep" parameter, which 
# keeps the first copy of a duplicated row
# -> "keep = false" -> would keep your entire table because everything is 
# duplicated

Unnamed: 0,school_id,school_name,region,province,municipality,division,district,year_level,gender,enrollment
38659,101746,"A. Diaz, Sr. ES",I (Ilocos Region),Pangasinan,Bautista,"Pangasinan II, Binalonan",Bautista,grade 1,female,64
38660,102193,A. P. Santos ES (SPED Center),I (Ilocos Region),Ilocos Norte,Laoag City (Capital),Laoag City,Laoag City District II,grade 1,female,38
38661,101283,A.P. Guevarra IS,I (Ilocos Region),Pangasinan,Bayambang,"Pangasinan I, Lingayen",Bayambang II,grade 1,female,16
38662,100216,Ab-Abut ES,I (Ilocos Region),Ilocos Norte,Piddig,Ilocos Norte,Piddig,grade 1,female,14
38663,100043,Abaca ES,I (Ilocos Region),Ilocos Norte,Bangui,Ilocos Norte,Bangui,grade 1,female,22
...,...,...,...,...,...,...,...,...,...,...
463903,136870,Wawa Elementary School,NCR (National Capital Region),NCR Third District,City of Navotas,Navotas,Navotas District II,grade 6,female,59
463904,136811,WawangPulo ES,NCR (National Capital Region),NCR Third District,City of Valenzuela,Valenzuela City,Valenzuela City North District,grade 6,female,41
463905,222501,West Fairview Elementary School,NCR (National Capital Region),NCR Second District,Quezon City,Quezon City,School District XI,grade 6,female,143
463906,136698,West Rembo ES,NCR (National Capital Region),NCR Fourth District,City of Makati,Makati City,Makati City District I,grade 6,female,117


In [27]:
deped2012.drop_duplicates?

# -> You can pass the same arguments that you used in the duplicated function
# to the drop duplicates. It would drop the row that was returned in the
# duplicated function.

#### Varying representation

For categorical or textual data, unless the options provided are fixed, misspellings and different representations may exist in the same file.

To check the unique values of each column, a `pandas` `Series` has a function `unique()` which returns all the unique values of the column.

In [31]:
deped2012['province'].unique()

array(['Pangasinan', 'Ilocos Norte', 'Ilocos Sur', 'La Union', 'Isabela',
       'Nueva Vizcaya', 'Cagayan', 'Quirino', 'Batanes', 'Nueva Ecija',
       'Bataan', 'Bulacan', 'Aurora', 'Zambales', 'Tarlac', 'Pampanga',
       'Batangas', 'Quezon', 'Laguna', 'Rizal', 'Cavite', 'Palawan',
       'Oriental Mindoro', 'Occidental Mindoro', 'Romblon', 'Marinduque',
       'Camarines Sur', 'Camarines Norte', 'Masbate', 'Sorsogon', 'Albay',
       'Catanduanes', 'Negros Occidental', 'Iloilo', 'Antique', 'Capiz',
       'Aklan', 'Guimaras', 'Bohol', 'Negros Oriental', 'Cebu',
       'Siquijor', 'Eastern Samar', 'Leyte', 'Western Samar',
       'Northern Samar', 'Southern Leyte', 'Biliran', 'Zamboanga del Sur',
       'Zamboanga Sibugay', 'Zamboanga del Norte', 'City of Isabela',
       'Misamis Occidental', 'Lanao del Norte', 'Misamis Oriental',
       'Bukidnon', 'Camiguin', 'Davao del Sur', 'Davao del Norte',
       'Davao Oriental', 'Compostela Valley', 'South Cotabato',
       'Sarangani', '

In [32]:
deped2012['year_level'].unique()

array(['grade 1', 'grade 2', 'grade 3', 'grade 4', 'grade 5', 'grade 6'],
      dtype=object)

In [33]:
deped2012['region'].unique()

array(['I (Ilocos Region)', 'II (Cagayan Valley)', 'III (Central Luzon)',
       'IV-A (CALABARZON)', 'IV-B (MIMAROPA)', 'V (Bicol Region)',
       'VI (Western Visayas)', 'VII (Central Visayas)',
       'VIII (Eastern Visayas)', 'IX (Zamboanga Peninsula)',
       'X (Northern Mindanao)', 'XI (Davao Region)', 'XII (SOCCSKSARGEN)',
       'XIII (Caraga)', 'ARMM (Autonomous Region in Muslim Mindanao)',
       'CAR (Cordillera Administrative Region)',
       'NCR (National Capital Region)'], dtype=object)

In [34]:
deped2015['region'].unique()

array(['Region I - Ilocos Region', 'Region II - Cagayan Valley',
       'Region III - Central Luzon', 'Region IV-A - CALABARZON',
       'Region IV-B - MIMAROPA', 'Region V - Bicol Region',
       'Region VI - Western Visayas', 'Region VII - Central Visayas',
       'Region VIII - Eastern Visayas', 'Region IX - Zamboanga Peninsula',
       'Region X - Northern Mindanao', 'Region XI - Davao Region',
       'Region XII - SOCCSKSARGEN', 'CARAGA - CARAGA',
       'ARMM - Autonomous Region in Muslim Mindanao'], dtype=object)

In [35]:
# -> From looking at the regions in the deped2012 dataset and deped2015 
# dataset, we can see that we cannot combine any of the regions as their
# representations are different.
# -> Unique can be used to see if our dataset has the same columns, so that
# we can aggregate them correctly in summarizing data.

### Summarizing Data

High data granularity is great for a detailed analysis. However, data is usually summarized or aggregated prior to visualization. `pandas` also provides an easy way to summarize data based on the columns you'd like using the `groupby` function.

We can call any of the following when grouping by columns:
- count()
- sum()
- min()
- max()
- std()

For columns that are categorical in nature, we can simply do `df['column'].value_counts()`. This will give the frequency of each unique value in the column. 

In [36]:
# pd.Series.value_counts()

deped2015['region'].value_counts()

# -> these values do not represent the enrollment counts; these are the 
# number of rows that has these values in their region column

Region VIII - Eastern Visayas                  41484
Region VI - Western Visayas                    39204
Region V - Bicol Region                        35892
Region VII - Central Visayas                   33768
Region III - Central Luzon                     33336
Region IV-A - CALABARZON                       31452
Region I - Ilocos Region                       27648
Region II - Cagayan Valley                     24924
Region IX - Zamboanga Peninsula                23724
Region X - Northern Mindanao                   23532
Region IV-B - MIMAROPA                         20232
Region XI - Davao Region                       18420
CARAGA - CARAGA                                18408
Region XII - SOCCSKSARGEN                      18228
ARMM - Autonomous Region in Muslim Mindanao     6036
Name: region, dtype: int64

In [38]:
deped2012['region'].value_counts()

VIII (Eastern Visayas)                         43728
VI (Western Visayas)                           40788
V (Bicol Region)                               37704
III (Central Luzon)                            35796
VII (Central Visayas)                          35196
IV-A (CALABARZON)                              32724
I (Ilocos Region)                              28728
ARMM (Autonomous Region in Muslim Mindanao)    26376
II (Cagayan Valley)                            26304
IX (Zamboanga Peninsula)                       25080
X (Northern Mindanao)                          25032
IV-B (MIMAROPA)                                22020
XII (SOCCSKSARGEN)                             20484
XIII (Caraga)                                  19968
XI (Davao Region)                              19584
CAR (Cordillera Administrative Region)         18180
NCR (National Capital Region)                   6216
Name: region, dtype: int64

In [None]:
# 1st problem: Missing regions
# --> Problem in data cleanliness and data completeness
# --> Solution: (1) drop the NCR and CAR data from 2012 and analyze them 
# separately, or (2) merge them normally (even if there's no data for 2012)

# 2nd problem: the region is represented differently
# --> Fix the data so that they can be merged together

In [39]:
deped2012.groupby?

# NOTE: If your notebook is running the wrong code that results in 
# crash, click the Kernel tab, and then SHut Down ALl Kernels.
# Afterwards, Reconnect to Kernel.

#### Exercise! 

Let's try to get the following:
1. Total number of enrolled students per region and gender
2. Total number of enrolled students per year level and gender

In [45]:
deped2012.groupby(['region', 'gender'], as_index=False).sum()

# deped2012.groupby(['region', 'gender'], as_index=False)['enrollment'].sum()
# -> would select just the enrollment column to be shown with the groupby
# condition (region and gender)
# -> NOTE: Just retain the needed columns so that the representation
# would be clear.

Unnamed: 0,region,gender,school_id,enrollment
0,ARMM (Autonomous Region in Muslim Mindanao),female,1981060044,325728
1,ARMM (Autonomous Region in Muslim Mindanao),male,1981060044,299438
2,CAR (Cordillera Administrative Region),female,1301046546,102069
3,CAR (Cordillera Administrative Region),male,1301046546,113411
4,I (Ilocos Region),female,1481294424,298996
5,I (Ilocos Region),male,1481294424,330234
6,II (Cagayan Valley),female,1396560900,207741
7,II (Cagayan Valley),male,1396560900,226833
8,III (Central Luzon),female,1965169458,629770
9,III (Central Luzon),male,1965169458,685337


In [47]:
deped2012.groupby(['year_level', 'gender']).sum()

# -> recommended to use "as_index=False" to get the pivot table, as it is 
# clearer.

Unnamed: 0_level_0,Unnamed: 1_level_0,school_id,enrollment
year_level,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
grade 1,female,4759017221,1262015
grade 1,male,4759017221,1500512
grade 2,female,4759017221,1144869
grade 2,male,4759017221,1281188
grade 3,female,4759017221,1052206
grade 3,male,4759017221,1138870
grade 4,female,4759017221,1003971
grade 4,male,4759017221,1059492
grade 5,female,4759017221,966741
grade 5,male,4759017221,998887


In [49]:
deped2012_regions = deped2012.groupby(['region', 'gender'], as_index=False)['enrollment'].sum()
deped2012_regions

Unnamed: 0,region,gender,enrollment
0,ARMM (Autonomous Region in Muslim Mindanao),female,325728
1,ARMM (Autonomous Region in Muslim Mindanao),male,299438
2,CAR (Cordillera Administrative Region),female,102069
3,CAR (Cordillera Administrative Region),male,113411
4,I (Ilocos Region),female,298996
5,I (Ilocos Region),male,330234
6,II (Cagayan Valley),female,207741
7,II (Cagayan Valley),male,226833
8,III (Central Luzon),female,629770
9,III (Central Luzon),male,685337


In [50]:
deped2015_regions = deped2015.groupby(['region', 'gender'], as_index=False)['enrollment'].sum()
deped2015_regions

Unnamed: 0,region,gender,enrollment
0,ARMM - Autonomous Region in Muslim Mindanao,female,68125
1,ARMM - Autonomous Region in Muslim Mindanao,male,63189
2,CARAGA - CARAGA,female,187179
3,CARAGA - CARAGA,male,205861
4,Region I - Ilocos Region,female,290392
5,Region I - Ilocos Region,male,322592
6,Region II - Cagayan Valley,female,207723
7,Region II - Cagayan Valley,male,227392
8,Region III - Central Luzon,female,601789
9,Region III - Central Luzon,male,657231


In [51]:
# -> It's easier to compare the two, but we still cannot merge them because
# of the varying values for region.

### Merging Data

Data are sometimes separated into different files or additional data from another source can be associated to another dataset. `pandas` provides means to combine different `DataFrames` together (provided that there are common variables that it can connect them to.

#### `pd.merge`
`merge()` is very similar to database-style joins. `pandas` allows merging of `DataFrame` and **named** `Series` objects together. A join can be done along columns or indexes.

#### `pd.concat`
`concat()` on the other hand combines `pandas` objects along a specific axis.

#### `df.append`
`append()` basically adds the rows of another `DataFrame` or `Series` to the end of the caller. 

In [53]:
pd.merge?

# -> the documentation shows examples on how to merge DataFrames together

In [54]:
pd.concat?

In [55]:
deped2012.append?

#### Exercise

The task is to compare the enrollment statistics of the elementary schools between 2012 and 2015. 

1. Get the total number of enrolled students per school for each year
2. Merge the two `DataFrame`s together to show the summarized statistics for the two years for all schools.

In [59]:
stats2012 = deped2012.groupby('school_id', as_index=False).sum()
stats2015 = deped2015.groupby('school_id', as_index=False).sum()

# -> school ID as a reference
# -> statistics of the school ID regardless of their region, etc. Just by 
# school ID as it is unique

In [60]:
stats2012.head()

Unnamed: 0,school_id,enrollment
0,100001,72
1,100002,365
2,100003,138
3,100004,89
4,100005,73


In [61]:
stats2012.shape

(38659, 2)

In [62]:
stats2015.head()

Unnamed: 0,school_id,enrollment
0,100001,72
1,100002,407
2,100003,152
3,100004,107
4,100005,67


In [63]:
stats2015.shape

(33024, 2)

In [64]:
# -> The two datasets have different number of rows because we don't have the
# NCR and CAR information for 2012.

#### Observations

1. Are the number of rows for both `DataFrames` the same or different? What's the implication if they're different?
2. Note the same column names for the two `DataFrames`. Based on the documentation for `merge()`, there's a parameter for suffixes for overlapping column names. If we want to avoid the "messy" suffixes, we can choose to rename columns prior to merging.

One way is to assign an array to the columns object representing the column names for ALL columns.

```ipython
stats2012.columns = ['school_id', '2012']
stats2015.columns = ['school_id', '2015']
```

But this is not good if you have too many columns... `pandas` has a function `rename()` in which we can pass a "mappable" dictionary for the columns. The `inplace` parameter helps in renaming it and assigns the changed `DataFrame` back to the same variable.

```ipython
stats2012.rename(columns={'enrollment': '2012'}, inplace=True)
stats2015.rename(columns={'enrollment': '2015'}, inplace=True)
```

In [67]:
# try the code above
stats2012.columns = ['school_id', '2012']
stats2015.columns = ['school_id', '2015']

In [68]:
## Merge the two dataframes using different "how" parameters
# how : {'left', 'right', 'outer', 'inner'}, default 'inner'
stats2012.merge(stats2015, on='school_id', how='inner')
stats2012

Unnamed: 0,school_id,2012
0,100001,72
1,100002,365
2,100003,138
3,100004,89
4,100005,73
...,...,...
38654,260501,54
38655,261001,697
38656,261501,25
38657,261502,12
