# Week 2 - Data Prep 1
After this week's lesson you should be able to:
- Checking a columns data types and converting types
- Rename a dataframe column 
- Handle missing data: 
    - Filter out missing data
    - Replace values 

This week's lessons are adapted from:
- [PPD599: Advanced Urban Analytics](https://github.com/gboeing/ppd599/tree/main/syllabus)
- [Geo-Python Lesson 5](https://geo-python-site.readthedocs.io/en/latest/notebooks/L5/processing-data-with-pandas.html)

In [50]:
# We are going to start importing the libraries we need
# all in one cell. 
# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.

import pandas as pd

# If you don't have numpy installed, you can install it via pip
# !pip install numpy in a code cell
import numpy as np

# 1. Data cleaning
As you might have already seen, when we work with data, the initial dataset is not always in a shape where we can use it as is. 

Sometimes column names are misspelled or unclear, there may be missing values, or the format of each column is incorrect. Moreoever you may also have noticed that often we can extract information from columns that might make them easier to work with. All these steps can be considered part of a data cleaning or data wrangling process, where we get the dataset ready to be used more effectively for our analysis purposes. 




## 1.1 Getting the data
Let's say we want to compare the relationship between: 
1. the **total number of students in a general ed public school** 
2. the **money spent on new school construction and improvements in that school**. 

### School Construction Authority

First, make sure you have the `Active_Projects_Under_Construction.csv` in your folder where this notebook is. It's from [Active Projects Under Construction](https://data.cityofnewyork.us/Housing-Development/Active-Projects-Under-Construction/8586-3zfm) from NYC's open data portal, but I've modified it a little.

This is a dataset of new school projects (Capacity) and Capital Improvement Projects (CIP) currently under Construction, created by the School Construction Authority. 




In [51]:
## Here we are going to read a csv directly from the web
## We are going to use the read_csv() function from the pandas library
## 

projects_under_const = pd.read_csv('Active_Projects_Under_Construction.csv')

Also, go ahead and download the data dictionary `SCA Active Projects in Construction Data Dictionary.xlsx`. Data dictionaries often have explanations for what each column name represents and other useful information about the data. 


If you open up the data dictionary, does it correspond to the "Columns in this Dataset" section in the NYC OpenData's page on this dataset? No, right? We have to be careful about these inconsistencies, even in official portals.

Taking a look at the first five rows we can already see there is a lot of missing data in this dataset. 

In [52]:
projects_under_const.head()

Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,
1,BAYSIDE HIGH SCHOOL - QUEENS,Q,26,FY19 RESO A AUDITORIUM UPGRADE,1261000.0,CIP,Q405,32-24 CORPORAL KENNEDY STREET,Queens,10301.0,...,,,,,,,,,1/6/22,2022.0
2,P.S. @ PARCEL F - QUEENS,Q,30,Demo,0.0,CAP,,2ND STREET BETWEEN 56TH AND 57TH AVENUE,Queens,11101.0,...,,,,,,,,,10/30/18,2018.0
3,3K CENTER @ 3893 DYRE AVENUE - BRONX,X,11,Lease,6262000.0,CAP,X501,3893 DYRE AVEUNE,Bronx,,...,,,,,,,,,8/4/22,2022.0
4,P.S. 129 - QUEENS,Q,25,Addition,0.0,CAP,Q129,128-02 7TH AVENUE,Queens,11356.0,...,-73.839771,7.0,19.0,945.0,4096774.0,4039760000.0,Whitestone,"(40.790638, -73.839771)",2/6/19,2019.0


### Class size dataset
Also download the `2021_-_2022_Average_Class_Size_by_School.csv` [2021 - 2022 Average Class Size by School](https://data.cityofnewyork.us/Education/2021-2022-Average-Class-Size-by-School/sgr7-hhwp) dataset, along with it's attachments. (Here, only `2021-2022 Average Class Size By School DD.xlsx` is the data dictionary, the other is the dataset as an excel spreadsheet). 


In [53]:
class_size = pd.read_csv('2021_-_2022_Average_Class_Size_by_School.csv')

In [54]:
class_size.head()

Unnamed: 0,DBN,School Name,Grade Level,Program Type,Number of Students,Number of Classes,Average Class Size,Minimum Class Size,Maximum Class Size
0,01M015,PS 015 ROBERTO CLEMENTE,K,G&T,13,1,13.0,<15,<15
1,01M015,PS 015 ROBERTO CLEMENTE,K,ICT,17,1,17.0,17,17
2,01M015,PS 015 ROBERTO CLEMENTE,1,G&T,8,1,8.0,<15,<15
3,01M015,PS 015 ROBERTO CLEMENTE,1,ICT,18,1,18.0,18,18
4,01M015,PS 015 ROBERTO CLEMENTE,2,G&T,8,1,8.0,<15,<15


Here, most of the columns make sense to me. From the data dictionary, I can see that Program Type is coded as follows:

- General Education (Gen Ed), 
- Integrated Co-Teaching (ICT), 
- Gifted and Talented (G&T), 
- Self-Contained (SC)
- Accelerated (Acc)"


What does not make sense is the `Minimum Class Size` column, which seems to be the same as the maximum class size column in some cases. Therefore, I'll likely not use this column.

## 1.2 Assessing Data Types
One of the next things we'll check is the data type for each column to make sure that they are in the right format. 

In [55]:
class_size.dtypes

DBN                    object
School Name            object
Grade Level            object
Program Type           object
Number of Students      int64
Number of Classes       int64
Average Class Size    float64
Minimum Class Size     object
Maximum Class Size     object
dtype: object

I would not necessarily change the data types for all columns (especially when there are a lot), **just the ones that you might potentially need**. 

Here, `Maximum Class Size` is an `object` format (I'm going to ignore `Minimum Class Size` for now), likely because the size is sometimes input as `<INT` and sometimes `INT`. 



## 1.3 Replacing Data

We went over replacing data last week. There are actually a few ways to do this: 
- `df.replace(to_replace=old_value, value=new_value)`


In [56]:
class_size['Grade Level'].unique()

array(['K', '1', '2', '3', '4', '5', '6', '7', '8', 'K-8 SC'],
      dtype=object)

In [57]:
## Warning: inplace=True will modify the original column!
class_size['Grade Level'].replace('K', '0', inplace=True) 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  class_size['Grade Level'].replace('K', '0', inplace=True)


In [58]:
class_size['Grade Level'].unique()

array(['0', '1', '2', '3', '4', '5', '6', '7', '8', 'K-8 SC'],
      dtype=object)

You can also replace multiple values at once

In [59]:
# We should probably not replace 'K-8 SC' with 0, but showing here for demonstration purposes
class_size['Grade Level'].replace(['K','K-8 SC'], '0', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  class_size['Grade Level'].replace(['K','K-8 SC'], '0', inplace=True)


In [60]:
class_size['Grade Level'].unique()

array(['0', '1', '2', '3', '4', '5', '6', '7', '8'], dtype=object)

Note that we are not actually changing the original data! Just the version of the data that we have associated with this variable.

We can also use 
* `df.loc[df['column_name'] == some_value, 'column_name'] = new_value`


In [61]:
class_size.loc[class_size['Grade Level'] == 'K','Grade Level'] = '0'    

For replacing null values, see below. 

## 1.3 Changing data types
Notice that we changed everything in "Grade Level" to numbers, but it's still showing up as an `object`.  Now let's try to change the data type for `Grade Level`. 

`.astype()` changes your column types for a particular column. 


In [62]:
class_size.dtypes

DBN                    object
School Name            object
Grade Level            object
Program Type           object
Number of Students      int64
Number of Classes       int64
Average Class Size    float64
Minimum Class Size     object
Maximum Class Size     object
dtype: object

In [63]:
## What I've done here is replace the old `max_class_size_clean` column with 
## a version of it that is an int
class_size['Grade Level'] = class_size['Grade Level'].astype(np.int64)

In [64]:
# Notice that `int` from above defaults to 64 bit integers. 
# Automatically this first showed as int32, I coerced the type into int64 by using numpy
class_size['Grade Level'].dtype

dtype('int64')

## 1.5 Null values in pandas. 

There are two main ways to represent the absence of values in a cell in Pandas: 
- `None` means a missing entry, but it's not a numeric type. 
- `NaN` is used by Pandas for representing missing data in numeric columns.


In [65]:
projects_under_const.head()

Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,
1,BAYSIDE HIGH SCHOOL - QUEENS,Q,26,FY19 RESO A AUDITORIUM UPGRADE,1261000.0,CIP,Q405,32-24 CORPORAL KENNEDY STREET,Queens,10301.0,...,,,,,,,,,1/6/22,2022.0
2,P.S. @ PARCEL F - QUEENS,Q,30,Demo,0.0,CAP,,2ND STREET BETWEEN 56TH AND 57TH AVENUE,Queens,11101.0,...,,,,,,,,,10/30/18,2018.0
3,3K CENTER @ 3893 DYRE AVENUE - BRONX,X,11,Lease,6262000.0,CAP,X501,3893 DYRE AVEUNE,Bronx,,...,,,,,,,,,8/4/22,2022.0
4,P.S. 129 - QUEENS,Q,25,Addition,0.0,CAP,Q129,128-02 7TH AVENUE,Queens,11356.0,...,-73.839771,7.0,19.0,945.0,4096774.0,4039760000.0,Whitestone,"(40.790638, -73.839771)",2/6/19,2019.0


## 1.6 Handling missing data
Now, let's say that our analysis depends knowing the year the data was created. There are a few ways of handling missing data. 

### 1.6.1 Removing rows 
We can remove those rows with data missing from a column that we are planning to use in our analysis. 

In [66]:
projects_under_const[projects_under_const['data_year'].isna()==True]

Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,


In [67]:
# Here we are going to use the isna() function to check if the data_year column has a NaN
# isna() returns a boolean (True or False) for each row
# and we are going to use that boolean to filter the dataframe. 
# We are going to keep only the rows where the data_year column is not a NaN

projects_under_const_new = projects_under_const[projects_under_const['data_year'].isna()==False]

### 1.6.2 Replacing missing data
We can also replace the missing data with certain values: 
- We can replace the data with the mean of the non-NaN column values, for numerical values. (For instance, if our columns were something like "adult heights", then replacing the NaN with the mean values in the columns would allow us to leave the sample mean unchanged, which might be good for regression purposes). 
- We can also replace with the median (if you think there are outliers in the sample that might be skewing the mean)
- Replacing with the mode (most frequent value) would make more sense if we think that there's some default value 

**What would you do here?**

In [68]:
#Looking at the dataset, it might be better to set the year as the median/mean, because the mode is
#some distance away from mean (2022 v. 2015). But since there is only one row missing the 'data_year' value,
#this decision probably doesn't matter too much.
print(projects_under_const['data_year'].describe())

projects_under_const[projects_under_const['data_year'].isna()]

count    9000.000000
mean     2015.535333
std        23.465827
min      1900.000000
25%      2019.000000
50%      2021.000000
75%      2022.000000
max      2022.000000
Name: data_year, dtype: float64


Unnamed: 0,School Name,BoroughCode,Geographical District,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode,...,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location 1,Data As Of,data_year
0,,M,2,,0.0,CAP,M777,227 WEST 27TH STREET,Manhattan,,...,,,,,,,,,,


In [69]:
# This gets the mode of the data_year column
mode_year = projects_under_const['data_year'].mode()

In [70]:
mode_year

0    2022.0
Name: data_year, dtype: float64

In [71]:
# This fills the NaNs with the mode using the fillna() function
# fillna() is a method that fills in missing values with a value of your choice
projects_under_const['data_year'].fillna(mode_year)


0       2022.0
1       2022.0
2       2018.0
3       2022.0
4       2019.0
         ...  
8996    2022.0
8997    2022.0
8998    2022.0
8999    2022.0
9000    2022.0
Name: data_year, Length: 9001, dtype: float64

In [48]:
# Now write over the old data_year column with the new one
projects_under_const['data_year'] = projects_under_const['data_year'].fillna(mode_year)

# In-Class Exercise 
Using the `toy_transit.csv` dataset in this repo, identify and address the  missing data issues. 

In [73]:
## insert your code here
transit_data =pd.read_csv("toy_transit_dataset.csv")

In [90]:
#types
print(transit_data.dtypes)

#examining the first few columns
print(transit_data.head())

#examining the last few columns
print(transit_data.tail())

#I also realized that this dataset is very small (12 rows * 5 columns)

Bus Route ID       object
Departure Time     object
Arrival Station    object
Passenger Count    object
Driver Name        object
dtype: object
  Bus Route ID Departure Time  Arrival Station Passenger Count Driver Name
0          101           8:00     Central Stn.              45    John Doe
1          102           9:30    East Terminal         MISSING  Jane Smith
2          103              -         West Hub              30     MISSING
3         101A           8:15  Central Station              50    John Doe
4          104          10:00      North Plaza              42    J. Brown
   Bus Route ID Departure Time Arrival Station Passenger Count Driver Name
7           101           8:05     Central St.              47    John Doe
8           NaN            NaN             NaN             NaN         NaN
9           NaN            NaN             NaN             NaN         NaN
10          NaN            NaN             NaN             NaN         NaN
11          NaN            NaN 

In [87]:
transit_data.size

60

In [92]:
#there is NaN values in all columns, and the datatype is not correct

#first examining bus route ID column for null values
transit_data[transit_data['Bus Route ID'].isna()]

Unnamed: 0,Bus Route ID,Departure Time,Arrival Station,Passenger Count,Driver Name
8,,,,,
9,,,,,
10,,,,,
11,,,,,


In [95]:
#these seem like noise (no data in any of the columns for these rows), so I remove them
transit_data_clean = transit_data[~transit_data['Bus Route ID'].isna()]
transit_data_clean

Unnamed: 0,Bus Route ID,Departure Time,Arrival Station,Passenger Count,Driver Name
0,101,8:00,Central Stn.,45,John Doe
1,102,9:30,East Terminal,MISSING,Jane Smith
2,103,-,West Hub,30,MISSING
3,101A,8:15,Central Station,50,John Doe
4,104,10:00,North Plaza,42,J. Brown
5,102,MISSING,East Term.,55,Jane Smith
6,103,7:50,West Hub,35,Alex Johnson
7,101,8:05,Central St.,47,John Doe


In [120]:
#now I clean departure time, I first tried to parse it to datetime, but that includes dates (which I don't want)
#then I tried to just get the time, but that automatically sets the type as string when put back into pandas
#I think that's ok.
#I left the missing data as is, because I think it's probably not accurate to set the time as something else
transit_data_clean['Departure Time'] =pd.to_datetime(transit_data_clean['Departure Time'], errors='coerce').dt.strftime('%H:%M')

transit_data_clean

  transit_data_clean['Departure Time'] =pd.to_datetime(transit_data_clean['Departure Time'], errors='coerce').dt.strftime('%H:%M')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transit_data_clean['Departure Time'] =pd.to_datetime(transit_data_clean['Departure Time'], errors='coerce').dt.strftime('%H:%M')


Unnamed: 0,Bus Route ID,Departure Time,Arrival Station,Passenger Count,Driver Name
0,101,08:00,Central Stn.,45,John Doe
1,102,09:30,East Terminal,MISSING,Jane Smith
2,103,,West Hub,30,MISSING
3,101A,08:15,Central Station,50,John Doe
4,104,10:00,North Plaza,42,J. Brown
5,102,,East Term.,55,Jane Smith
6,103,07:50,West Hub,35,Alex Johnson
7,101,08:05,Central St.,47,John Doe


In [150]:
#passenger count replace missing with NaN
# transit_data_clean['Passenger Count'].replace('NaN', 'MISSING', inplace=True) this causes problems later when converting to int

#then convert type to int
transit_data_clean['Passenger Count']=pd.to_numeric(transit_data_clean['Passenger Count'], errors='coerce').astype(pd.Int64Dtype())

#replace missing value with mean (the mean doesn't account for missing value, which is good)
#we round because passengers are people, which are integers
transit_data_clean['Passenger Count'].replace(pd.NA, round(transit_data_clean['Passenger Count'].mean()), inplace=True)

transit_data_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transit_data_clean['Passenger Count']=pd.to_numeric(transit_data_clean['Passenger Count'], errors='coerce').astype(pd.Int64Dtype())
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  transit_data_clean['Passenger Count'].replace(pd.NA, round(transit_data_clean['Passenger Count'].mean()), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documenta

Unnamed: 0,Bus Route ID,Departure Time,Arrival Station,Passenger Count,Driver Name
0,101,08:00,Central Stn.,45,John Doe
1,102,09:30,East Terminal,43,Jane Smith
2,103,,West Hub,30,MISSING
3,101A,08:15,Central Station,50,John Doe
4,104,10:00,North Plaza,42,J. Brown
5,102,,East Term.,55,Jane Smith
6,103,07:50,West Hub,35,Alex Johnson
7,101,08:05,Central St.,47,John Doe


In [170]:
#lastly, Driver Name
#this data is probably not too important, so we set it to the mode
replacement_name =transit_data_clean['Driver Name'].mode()[0]

transit_data_clean['Driver Name'].replace("MISSING", replacement_name, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transit_data_clean['Driver Name'].replace("MISSING", replacement_name, inplace=True)


In [172]:
transit_data_clean

Unnamed: 0,Bus Route ID,Departure Time,Arrival Station,Passenger Count,Driver Name
0,101,08:00,Central Stn.,45,John Doe
1,102,09:30,East Terminal,43,Jane Smith
2,103,,West Hub,30,John Doe
3,101A,08:15,Central Station,50,John Doe
4,104,10:00,North Plaza,42,J. Brown
5,102,,East Term.,55,Jane Smith
6,103,07:50,West Hub,35,Alex Johnson
7,101,08:05,Central St.,47,John Doe


# Miniconda (optional)
Some of you may have noticed that Anaconda takes up 3GB. If this is an issue on your computer, and if you have time right now: 

1) Follow [these instructions](https://docs.conda.io/projects/miniconda/en/latest/miniconda-install.html) to download Miniconda, which is a more lightweight Python environment. I think it's about 400 MB. 
2) Once you download miniconda, from your terminal, type 
`conda list`. 
If you get a list of installed packages, you've got conda installed. 
3) Now use the `gds_py_smaller.yml` file (make sure it's in the same directory as your current working directory!) and type

 `conda env create -f gds_py_smaller.yml`

In [None]:
#I installed miniforge (which is for Mamba)