# Stage 2: Preparing Visualization Dataset

This notebook contains the instructions to prepare a dataset for a bubble map data visualization (made in Power BI) which uses color encoding to identify which high schools teach Computer Science and which don't. 

This dataset will consist of three key types of columns: 

1. **School Course Statistics** such as whether it teaches comptuer science or not, how many computer science courses does it teach, how many computer science enrollments were recorded for the year. 

2. **School Information** such as the School Name, Email Address, Principal Name, Phone Number

3. **School GeoData** such as the latitude and longitude

The datasets imported in this notebook will be from the `data/labelled_data` path. These datasets were prepared by us in the previous Stage 1: Labelling Courses notebook. 

These instructions will be of help when you want to prepare data for a visualization in future years. The original file format provided by OSPI may slightly change but this is a good reference point to see our method. You may need to tweak the names of the files being imported and the column names if they have changed with time.

*Note: For new Panda users, a dataframe is a dataset table which consists of columns and rows. We will be using this term frequently when discussing our process.*

## Part 1: Setup

In this step we will import each of the necessary packages for our data preparation and data wrangling.  

We will be using the following packages: 

1. **CSV:** to read in csv files
2. **Pandas:** for data wrangling - this involves reshaping, merging, concatting(adding 2 dataframes), adding and removing columns, renaming columns, grouping and summarizing. 
3. **Numpy:** for mathematical operations - this involves setting the data types of columns and setting defaults values of columns
4. **Altair:** for data visualizations. We will be using this package to quickly create a rapid data visualization to test our data visualization dataset. 

In [40]:
#Setup (Importing Packages) 
import csv #to read in csv files
import pandas as pd #for data wrangling
import numpy as np  #for mathematical operations
import altair as alt #for data visualization
import warnings

## Part 2: Preparing the CIP Course Dataset

In this part we will be preparing the CIP Course Dataset to showcase the total students in each CIP course. 

**1. In this step we are importing `cip_course_statistics_2017.csv` file using the pandas read_csv method and saving it as a dataframe.** 

This dataframe contains a list of the CIP courses taught at middle schools and high schools in the state of Washington. Please note that this list is partially complete as data was not available for all Washington schools in the year 2017.

*Note: That you will need to edit the file name to match the year of the file you are aiming to process a file for. So for example you would change the file name from 'cip_course_statistics_2017.csv' to 'cip_course_statistics_2018.csv'*

In [2]:
#Importing and Saving Student Results for CIP Courses
#Note: this is where you will want to change the file name for the new CIP Student Results Dataset
cip_courses  = pd.read_csv("data/labelled_data/CIP_Data/cip_course_statistics_2017.csv")

**2. The head of the dataframe (the first 5 rows) has been printed for you to get a better understanding of the data.**

*Note: If you want to see all the data just type `cip_courses` in a new cell block and click on Shift and Enter at the same time.*

**Here is a quick explanation on what each column is.**

| **COLUMN NAME** | **COLUMN DESCRIPTION** |
| ----------- | ----------- |
| **DistrictCode:** | Code of the School District in which the school is (e.g. 2420) |
| **DistrictName:** | Name of the School District in which the school is (e.g. Asotin-Anatone School District) |
| **SchoolCode:** |Code of the School (e.g. 2434) |
| **SchoolName:**v| Name of the School (e.g. Asotin Jr Sr High) |
| **term:** | Which semester was the course in (e.g. SEM1) |
| **cipcode:** | National Course Code under which this class falls |
|  **courseTitle:** | Title of the course (e.g. AP Computer Science Principles) |
| **letterGrade:** | LetterGrade (e.g. A, A-, B, etc.) |
| **count:** | total students who received that letterGrade in the course |
| **cs_course:** | whether the course is a computer science course or not |

In [3]:
# Printing the head of the dataframe
cip_courses.head(5)

Unnamed: 0,DistrictCode,DistrictName,SchoolCode,SchoolName,term,cipcode,courseTitle,letterGrade,count,cs_course
0,2420,Asotin-Anatone School District,2434,Asotin Jr Sr High,SEM1,110201,AP Computer Science Principles,A,3,yes
1,2420,Asotin-Anatone School District,2434,Asotin Jr Sr High,SEM1,110201,CSS ENGINEERING,A-,1,yes
2,2420,Asotin-Anatone School District,2434,Asotin Jr Sr High,SEM1,110201,CSS ENGINEERING,A,3,yes
3,2420,Asotin-Anatone School District,2434,Asotin Jr Sr High,SEM1,110201,CSS ENGINEERING,D,1,yes
4,2420,Asotin-Anatone School District,2434,Asotin Jr Sr High,SEM2,110201,AP Computer Science Principles,A,3,yes


**3. In the `cip_courses` dataframe each letterGrade statistic of a course is it's own row. This increases the total number of rows as well as makes it difficult for us to see the statistics for a course easily in one row. Therefore we are going to spread the data such that each letter grade becomes its own column. This will also help us in calculating the total students easily. I have printed the head of the `cip_courses` dataframe for you to easily see the changes made.**

In [4]:
#Reshaping (Spreading) - lettergrades are becoming columns
cip_courses = pd.pivot_table(cip_courses, index = ['DistrictCode','DistrictName','SchoolCode','SchoolName','term','cipcode','courseTitle', 'cs_course'], columns = 'letterGrade', values = 'count')

# We do the NA replacement to 0 step below because each course 
# did not always have a count of 1 or more for each letterGrade. 
# Because of this some letterGrade columns for a course filled the cell value as NA instead of 0. 

#Fill NA for Letter Grades to 0
cip_courses = cip_courses.fillna(0)

#Strip the extra space at the start and end of column names
cip_courses.columns = cip_courses.columns.str.rstrip()

#List of Column Names
cols = ['A', 'A-', 'B', 'B+', 'B-', 'C','C+', 'C-', 'CR', 'D', 'D+', 'E', 'F', 'N', 'NC', 'P', 'S', 'U', 'W']

#Convert the Columns listed in the `cols` list to Integer DataType
cip_courses[cols] = cip_courses[cols].applymap(np.int64)

cip_courses.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,letterGrade,A,A-,B,B+,B-,C,C+,C-,CR,D,D+,E,F,N,NC,P,S,U,W
DistrictCode,DistrictName,SchoolCode,SchoolName,term,cipcode,courseTitle,cs_course,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
1147,Othello School District,3015,Othello High School,SEM2,110801,DIGITAL DESIGN,no,27,5,3,3,3,3,2,3,0,8,0,0,3,0,0,0,0,0,0
1158,Lind School District,2903,Lind-Ritzville High School,SEM1,110103,TECHNOLOGY 1A,no,4,2,1,0,0,0,1,2,0,2,0,0,0,0,0,0,0,0,0
1158,Lind School District,2903,Lind-Ritzville High School,SEM1,110801,PHOTOGRAPHY,no,2,1,0,0,0,0,1,1,0,0,0,0,2,0,0,0,0,0,0
1158,Lind School District,2903,Lind-Ritzville High School,SEM2,110103,TECHNOLOGY 1B,no,3,0,0,0,0,0,1,1,0,2,0,0,3,0,0,0,0,0,0
1158,Lind School District,2903,Lind-Ritzville High School,SEM2,110801,PHOTOGRAPHY,no,1,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0


**4. We will now add a column called total students in the course which sums up the count for each lettergrade. I have printed the head of the column for you to see the new `total_students` column which has been added.**

In [5]:
#Adding Column Stating Total Students in Course
cip_courses['total_students'] = cip_courses['A'] + cip_courses['A-'] + cip_courses['B'] + cip_courses['B+']+ cip_courses['B-']+ cip_courses['C']+ cip_courses['C-']+ cip_courses['C+']+ cip_courses['CR']+ cip_courses['D']+ cip_courses['D+']+ cip_courses['E']+ cip_courses['F']+ cip_courses['N']+ cip_courses['NC']+ cip_courses['P']+ cip_courses['S']+ cip_courses['U']+ cip_courses['W']                

#Resetting Index of Dataframe.
cip_courses = cip_courses.reset_index()

#Showing the head of the dataframe
cip_courses.head()

letterGrade,DistrictCode,DistrictName,SchoolCode,SchoolName,term,cipcode,courseTitle,cs_course,A,A-,...,D+,E,F,N,NC,P,S,U,W,total_students
0,1147,Othello School District,3015,Othello High School,SEM2,110801,DIGITAL DESIGN,no,27,5,...,0,0,3,0,0,0,0,0,0,60
1,1158,Lind School District,2903,Lind-Ritzville High School,SEM1,110103,TECHNOLOGY 1A,no,4,2,...,0,0,0,0,0,0,0,0,0,12
2,1158,Lind School District,2903,Lind-Ritzville High School,SEM1,110801,PHOTOGRAPHY,no,2,1,...,0,0,2,0,0,0,0,0,0,7
3,1158,Lind School District,2903,Lind-Ritzville High School,SEM2,110103,TECHNOLOGY 1B,no,3,0,...,0,0,3,0,0,0,0,0,0,10
4,1158,Lind School District,2903,Lind-Ritzville High School,SEM2,110801,PHOTOGRAPHY,no,1,1,...,0,0,0,0,0,0,0,0,0,5


## Part 3: Preparing the State Course Code DataSet
In this step, we are preparing the State Course Code Dataset to showcase the total students in each state course code course.

**1. In this step we are importing the `state_course_code_statistics_2017.csv` using the read_csv method of the Pandas package and saving it as a dataframe.**

This dataframe is a list of state course code courses taught at schools in the state of Washington. Please note that this list is partially complete as data is not available for all Washington state schools as of 2017. 

*Note: That you will need to edit the file name to match the year of the file you are aiming to process a file for. So for example you would change the file name from 'state_course_code_statistics_2017.csv' to 'state_course_code_statistics_2018.csv'*

In [6]:
#Importing and Saving Student Results for State Courses
#Note: this is where you will want to change the file name for the new SCC Student Results Dataset
scc_courses = pd.read_csv("data/labelled_data/State_Course_Code_Data/state_course_code_statistics_2017.csv")

**2. The head of the dataframe (the first 5 rows) has been printed for you to get a better understanding of the data.**

*Note: If you want to see all the data just type `scc_courses` in a new cell block and click on Shift and Enter at the same time.*

**Here is a quick explanation on what each column is.**

| **COLUMN NAME** | **COLUMN DESCRIPTION** |
| ----------- | ----------- |
| **DistrictCode:** | Code of the School District in which the school is (e.g. 2420) |
| **DistrictName:** | Name of the School District in which the school is (e.g. Asotin-Anatone School District) |
| **SchoolCode:** |Code of the School (e.g. 2434) |
| **SchoolName:**v| Name of the School (e.g. Asotin Jr Sr High) |
| **term:** | Which semester was the course in (e.g. SEM1) |
| **stateCourseCodeId:** | State Course Code under which this class falls |
|  **courseTitle:** | Title of the course (e.g. AP Computer Science Principles) |
| **letterGrade:** | LetterGrade (e.g. A, A-, B, etc.) |
| **count:** | total students who received that letterGrade in the course |
| **cs_course:** | whether the course is a computer science course or not |

In [7]:
#printing the head of scc_courses
scc_courses.head()

Unnamed: 0,DistrictCode,DistrictName,SchoolCode,SchoolName,term,stateCourseCodeId,courseTitle,letterGrade,count,cs_course
0,17407,Riverview School District,3524,Cedarcrest High School,SEM1,837,PRG/GAMES/SIM A,A-,2,no
1,17407,Riverview School District,3524,Cedarcrest High School,SEM1,837,PRG/GAMES/SIM A,A,6,no
2,17407,Riverview School District,3524,Cedarcrest High School,SEM1,837,PRG/GAMES/SIM A,B,1,no
3,17407,Riverview School District,3524,Cedarcrest High School,SEM1,837,PRG/GAMES/SIM A,B+,1,no
4,17407,Riverview School District,3524,Cedarcrest High School,SEM1,837,PRG/GAMES/SIM A,C,1,no


**3. In the `scc_courses` dataframe each letterGrade statistic of a course is it's own row. This increases the total number of rows as well as makes it difficult for us to see the statistics for a course easily in one row. Therefore we are going to spread the data such that each letter grade becomes its own column. This will also help us in calculating the total students easily. I have printed the head of the `scc_courses` dataframe for you to easily see the changes made.**

In [8]:
#Reshaping (Spreading) - lettergrades are becoming columns
scc_courses = pd.pivot_table(scc_courses, index = ['DistrictCode','DistrictName','SchoolCode','SchoolName','term','stateCourseCodeId','courseTitle', 'cs_course'], columns = 'letterGrade', values = 'count')

# We do the NA replacement to 0 step below because each course 
# did not always have a count of 1 or more for each letterGrade. 
# Because of this some letterGrade columns for a course filled the cell value as NA instead of 0. 

#Fill NA for Letter Grades to 0 
scc_courses = scc_courses.fillna(0)

#Strip the extra space at the start and end of column names
scc_courses.columns = scc_courses.columns.str.rstrip()

#List of Column Names
cols = ['A', 'A-', 'B', 'B+', 'B-', 'C','C+', 'C-', 'CR', 'D', 'D+', 'E', 'F', 'N', 'NC', 'P', 'S', 'U', 'W']

#Convert Columns in List to Integer DataType
scc_courses[cols] = scc_courses[cols].applymap(np.int64)

scc_courses.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,letterGrade,A,A-,B,B+,B-,C,C+,C-,CR,D,D+,E,F,N,NC,P,S,U,W
DistrictCode,DistrictName,SchoolCode,SchoolName,term,stateCourseCodeId,courseTitle,cs_course,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
1109,Washtucna School District,3075,Washtucna Elementary/High School,SEM1,2309,10 ENGLISH,no,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0
1109,Washtucna School District,3075,Washtucna Elementary/High School,SEM2,2309,10 ENGLISH,no,3,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1147,Othello School District,3015,Othello High School,SEM1,2696,DIGITOOLS,no,4,1,8,0,0,3,1,1,0,5,2,0,3,0,0,0,0,0,0
1147,Othello School District,3015,Othello High School,SEM1,2696,DIGITOOLS C/D,no,11,1,5,1,0,0,1,1,0,1,1,0,1,0,0,0,0,0,2
1147,Othello School District,3015,Othello High School,SEM2,2696,DIGITOOLS,no,11,4,3,5,6,7,1,3,0,5,2,0,4,0,0,0,0,0,2


**4. We will now add a column called total students in the course which sums up the count for each lettergrade. I have printed the head of the column for you to see the new `total_students` column which has been added.**

In [9]:
#Adding Column Stating Total Students in Course
scc_courses['total_students'] = scc_courses['A'] + scc_courses['A-'] + scc_courses['B'] + scc_courses['B+']+ scc_courses['B-']+ scc_courses['C']+ scc_courses['C-']+ scc_courses['C+']+ scc_courses['CR']+ scc_courses['D']+ scc_courses['D+']+ scc_courses['E']+ scc_courses['F']+ scc_courses['N']+ scc_courses['NC']+ scc_courses['P']+ scc_courses['S']+ scc_courses['U']+ scc_courses['W']                

#Resetting Index of Dataframe
scc_courses = scc_courses.reset_index()

#Showing the head of the dataframe
scc_courses.head()

letterGrade,DistrictCode,DistrictName,SchoolCode,SchoolName,term,stateCourseCodeId,courseTitle,cs_course,A,A-,...,D+,E,F,N,NC,P,S,U,W,total_students
0,1109,Washtucna School District,3075,Washtucna Elementary/High School,SEM1,2309,10 ENGLISH,no,1,0,...,0,0,0,0,0,0,0,0,0,3
1,1109,Washtucna School District,3075,Washtucna Elementary/High School,SEM2,2309,10 ENGLISH,no,3,0,...,0,0,0,0,0,0,0,0,0,5
2,1147,Othello School District,3015,Othello High School,SEM1,2696,DIGITOOLS,no,4,1,...,2,0,3,0,0,0,0,0,0,28
3,1147,Othello School District,3015,Othello High School,SEM1,2696,DIGITOOLS C/D,no,11,1,...,1,0,1,0,0,0,0,0,2,25
4,1147,Othello School District,3015,Othello High School,SEM2,2696,DIGITOOLS,no,11,4,...,2,0,4,0,0,0,0,0,2,53


## About Testing
This part onwards, there will be a test written after key parts to test whether any rows have been unintentionally dropped during the merging and concatting actions. There will also be tests written to check if any important columns have any missing data. Please pay attention to the test results printed after each cell. The head of a dataframe will only be printed if the test is passed. These may be simple if else tests but it is very important to ensure all tests get passed. 

## Part 4: Combining CIP Course Data and State Course Data

In this step, we will be combining the CIP Code Course dataframe (`cip_courses`) and State Course Code Course dataframe (`scc_courses`) into one dataframe. 

In [45]:
#Combining the State Course and CIP Course Data using the concat method of Pandas
all_courses = pd.concat([cip_courses, scc_courses], sort=False)

####################################################################################################
### ALL COURSES TEST:
### This test checks if the total rows of all courses is equal 
### to the sum of rows of the CIP courses and SCC courses dataframes

# Saving Test variable for test at the end
all_courses_test = (len(all_courses)== (len(cip_courses) + len(scc_courses)))

if(all_courses_test):
    print("PASSED TEST: ALL COURSES TEST")

else:
    print("NOT PASSED: ALL COURSES TEST")    

# Printing the head of the dataframe
all_courses.head()

PASSED TEST: ALL COURSES TEST


Unnamed: 0,DistrictCode,DistrictName,SchoolCode,SchoolName,term,cipcode,courseTitle,cs_course,A,A-,...,E,F,N,NC,P,S,U,W,total_students,stateCourseCodeId
0,1147,Othello School District,3015,Othello High School,SEM2,110801.0,DIGITAL DESIGN,no,27,5,...,0,3,0,0,0,0,0,0,60,
1,1158,Lind School District,2903,Lind-Ritzville High School,SEM1,110103.0,TECHNOLOGY 1A,no,4,2,...,0,0,0,0,0,0,0,0,12,
2,1158,Lind School District,2903,Lind-Ritzville High School,SEM1,110801.0,PHOTOGRAPHY,no,2,1,...,0,2,0,0,0,0,0,0,7,
3,1158,Lind School District,2903,Lind-Ritzville High School,SEM2,110103.0,TECHNOLOGY 1B,no,3,0,...,0,3,0,0,0,0,0,0,10,
4,1158,Lind School District,2903,Lind-Ritzville High School,SEM2,110801.0,PHOTOGRAPHY,no,1,1,...,0,0,0,0,0,0,0,0,5,


## Part 5: Making a lite version of all_courses dataframe

In this cell below, we are making a lite version of the all courses dataframe by selecting only the columns of concern to us which are: ``"DistrictCode","SchoolCode", "SchoolName", "courseTitle", "cs_course", "total_students"``. The head of the dataframe has been printed for you to see the lite version of this dataframe.

In [46]:
# Making a small version of the all courses file with only the necessary columns
all_courses_lite = all_courses.copy()

#Selecting only columns of concern
all_courses_lite = all_courses_lite[["DistrictCode","SchoolCode", "SchoolName", "courseTitle", "cs_course", "total_students"]]


# Test 
all_courses_lite_test = (len(all_courses) == len(all_courses_lite))

if(all_courses_lite_test):
    print("PASSED TEST: ALL COURSES LITE")
else: 
    print("NOT PASSED TEST: ALL COURSES LITE")

all_courses_lite.head()

PASSED TEST: ALL COURSES LITE


Unnamed: 0,DistrictCode,SchoolCode,SchoolName,courseTitle,cs_course,total_students
0,1147,3015,Othello High School,DIGITAL DESIGN,no,60
1,1158,2903,Lind-Ritzville High School,TECHNOLOGY 1A,no,12
2,1158,2903,Lind-Ritzville High School,PHOTOGRAPHY,no,7
3,1158,2903,Lind-Ritzville High School,TECHNOLOGY 1B,no,10
4,1158,2903,Lind-Ritzville High School,PHOTOGRAPHY,no,5


## Part 6: Listing Schools on which we have course data

**As discussed earlier we do not have course information on all schools in the state of Washington. In the cell below we are making a dataframe which contains the names of all the schools we have course information on. We have printed the head of the dataframe for your easy understanding. Below we have printed the head of the dataframe for easy viewing.**

In [54]:
#Copy of all_courses_lite is made for the list of known schools
known_schools = all_courses_lite.copy()

# We only select the most important columns needed
known_schools = known_schools[["SchoolName", "DistrictCode", "SchoolCode"]]

# Since our dataframe has multiple copies of schools we will drop the duplicates, such that each row has a unique School Code
# WARNING: We use the school code when dropping duplicates instead of dropping by School Name as some schools have the same name
known_schools = known_schools.drop_duplicates(['SchoolCode'])

# We reset the index for easy reading and to ensure our rows are in sequential order. 
# Note: when we reindex, a new column called index is added with the old indexes, 
# Note(contd) we drop this column as it is not necessary
known_schools = known_schools.reset_index().drop(columns = ['index'])

# Printing the head of known schools
known_schools.head()

Unnamed: 0,SchoolName,DistrictCode,SchoolCode
0,Othello High School,1147,3015
1,Lind-Ritzville High School,1158,2903
2,Ritzville High School,1160,2132
3,Asotin Jr Sr High,2420,2434
4,Mid-Columbia Parent Partnership,3017,1941


## Part 7: Listing Schools which teach computer science
**In the cell below, we are creating a dataframe which consists of a list of schools which teach computer science, the total computer science classes they teach, and the total number of students they have enrolled in computer science. We do this by filtering out for rows which have the the cs_course column cell value as "yes" (in the all_courses dataframe).**

In [52]:
# Filtering to keep only Computer Science courses
cs_results = all_courses.loc[all_courses["cs_course"] == "yes"]

# Grouping by high school and summarizing for the count of computer science classes taught
cs_schools = cs_results.groupby(['SchoolCode','SchoolName']).agg({'cs_course': 'count', 'total_students': 'sum'})

# Adding Column to say School Teaches Computer Science
cs_schools["school_teaches_cs"] = "Teaches Computer Science"

# Resetting the Index after grouping by
cs_schools = cs_schools.reset_index()

# Renaming column to state total computer science courses taught in that year
cs_schools = cs_schools.rename(columns = {'cs_course': 'total_cs_courses', 'total_students': 'yearly_enrolled_in_cs'})

# Printing head of schools which teach computer science dataframe
cs_schools.head()

Unnamed: 0,SchoolCode,SchoolName,total_cs_courses,yearly_enrolled_in_cs,school_teaches_cs
0,1519,Edmonds eLearning Academy,1,1,Teaches Computer Science
1,1547,Middle College High School,4,58,Teaches Computer Science
2,1627,Yelm Extension School,2,3,Teaches Computer Science
3,1628,Dishman Hills High School,6,158,Teaches Computer Science
4,1640,Puyallup Online Academy/POA,2,6,Teaches Computer Science


## Part 8: Listing Schoools which don't teach computer science
**In the cell below, we are creating a dataframe which consists of a list of schools which do not teach computer science, the total computer science classes they teach (in this case 0), and the total number of students they have enrolled in computer science (in this case 0). We do this by retaining a list of schools which are only present in the `known_schools dataframe` and not in the `cs_schools` dataframe. Below we have printed the head of this dataframe for your easy viewing.**

In [59]:
non_cs_schools = known_schools[~known_schools.SchoolCode.isin(cs_schools.SchoolCode)]

#Adding column to say it teaches 0 cs courses
non_cs_schools["total_cs_courses"] = 0

#Adding column to say it has 0 students enrolled in CS
non_cs_schools["yearly_enrolled_in_cs"] = 0

#Adding column to say that School does not teache CS
non_cs_schools["school_teaches_cs"] = "Doesn't Teach Computer Science"

#Resetting Index
non_cs_schools = non_cs_schools.reset_index().drop(columns = ['index', 'DistrictCode'])

#Selecting columns to keep
#non_cs_schools = non_cs_schools[['SchoolName','SchoolCode', 'total_cs_courses', 'yearly_enrolled_in_cs', "school_teaches_cs"]]

#Printing schools which do not teach CS
non_cs_schools.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,SchoolName,SchoolCode,total_cs_courses,yearly_enrolled_in_cs,school_teaches_cs
0,Othello High School,3015,0,0,Doesn't Teach Computer Science
1,Lind-Ritzville High School,2903,0,0,Doesn't Teach Computer Science
2,Prosser High School,2508,0,0,Doesn't Teach Computer Science
3,Richland High School,3511,0,0,Doesn't Teach Computer Science
4,Entiat Middle and High School,3317,0,0,Doesn't Teach Computer Science


## Part 9: Listing Statistics of Known Schools

In this step, we are going to list the statistics we have about the state of computer education in schools we have information on. We do this by concatting (adding) the newly created `cs_schools` and `non_cs_schools` dataframes made in Parts 6 and 7.

In [62]:
#Adding CS and Non CS Schools Data Frames
known_schools_stats = pd.concat([cs_schools, non_cs_schools]) 

#Resetting Index
known_schools_stats = known_schools_stats.reset_index().drop(columns = ['index'])

#Printing the head of the dataframe
known_schools_stats.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  


Unnamed: 0,SchoolCode,SchoolName,school_teaches_cs,total_cs_courses,yearly_enrolled_in_cs
0,1519,Edmonds eLearning Academy,Teaches Computer Science,1,1
1,1547,Middle College High School,Teaches Computer Science,4,58
2,1627,Yelm Extension School,Teaches Computer Science,2,3
3,1628,Dishman Hills High School,Teaches Computer Science,6,158
4,1640,Puyallup Online Academy/POA,Teaches Computer Science,2,6


## Part 10: List of All High Schools in Washington State

In this step we are importing the `High_Schools_WA_Information.csv` file using the Panda package's read_csv method and saving it as a dataframe called `high_schools`. This file was sourced from the [OSPI School Directory website](https://eds.ospi.k12.wa.us/directoryeds.aspx). 

**Note:** This file modified in the following ways: 

    - Only the most relevant columns were kept for our visualization dataset were kept. Those were: 'LEACode', 'LEAName', 'SchoolCode', 'SchoolName', 'LowestGrade','HighestGrade', 'PrincipalName', 'Email','Phone', 'OrgCategoryList','GradeCategory', and 'City'. 

    - We only kept the schools which have their highest grade as 9, 10, 11 or 12. Some of these schools are alternate schools, jails, detention centres and learning centres. Do not be surprised by this. We have retained these schools so as to be inclusive. 

    - It was missing the names of the last three schools listed in this csv file. I added these manually by doing a google search for information. Please beware that the quality of the data from OSPI is not 100% complete always. 

Below I have listed out the description of each column name for easy understanding.

| **Column Name** | **Column Description** | 
| ----------- | ----------- |
| LEACode | Local Education Agency	Code e.g. 3346|
| LEAName | Local Education Agency	Name e.g. Colfax School District|
| SchoolCode | School Code of the School in Washington State e.g. 3366|
| SchoolName | Name of the School e.g. Colfax High School|
| LowestGrade | Lowest Grade in the School e.g. 7 |
| HighestGrade | Highest Grade in the School e.g. 12|
| Principal Name | Name of the School Principal e.g. David Gibb |
| Email | Email of the Principal e.g. david.gibb@csd300.com	|
| Phone |Phone Number of the Principal e.g. 509.830.2347	|
|OrgCategoryList |Type of Category the School falls under e.g. Public School, Regular School	| 
|Grade Category | Type of School e.g. High School, K-12, etc.) |
| City | City Name e.g. Colfax |

In [65]:
# importing and saving the High_Schools_WA_Information CSV file
high_schools = pd.read_csv("data/labelled_data/School_Data/High_Schools_WA_Information.csv")
high_schools.head()

Unnamed: 0,LEACode,LEAName,SchoolCode,SchoolName,LowestGrade,HighestGrade,PrincipalName,Email,Phone,OrgCategoryList,GradeCategory,City
0,38300,Colfax School District,3366,Colfax High School,7,12,David Gibb,david.gibb@csd300.com,509.830.2347,"Public School, Regular School",High School,Colfax
1,38301,Palouse School District,2634,Palouse High School,9,12,Mike Jones,mjones@garpal.net,509.878.1921,"Public School, Regular School",High School,Palouse
2,38306,Colton School District,2588,Colton School,PK,12,Tim Casey,tcasey@colton.k12.wa.us,509.229.3386,"Public School, Regular School",PK-12,Colton
3,38320,Rosalia School District,3204,Rosalia Elementary & Secondary School,PK,12,Matthew McLain,mmclain@rosaliaschools.org,509.523.3061,"Public School, Regular School",PK-12,Rosalia
4,38322,St. John School District,3068,St John/Endicott High,9,12,Mark Purvine,mpurvine@stjohn.wednet.edu,509.648.3336,"Public School, Regular School",High School,Saint John


In [72]:
len(high_schools)

660

## Part 11: Listing High Schools we have Statistics on

The list of statistics we have is for all types of schools in the state of Washington. We only want statistics for schools which have a grade of 9 and above. Therefore we are going to make a new dataframe called `known_high_school_stats` which will retain only the statistics of high schools in `known_schools_stats`.

In [79]:
# Making a copy of the known_school_stats dataframe
known_high_school_stats = known_schools_stats.copy()

# We are only retaining school rows which are high schools
known_high_school_stats = known_high_school_stats[known_high_school_stats.SchoolCode.isin(high_schools.SchoolCode)]

# Resetting Index
known_high_school_stats = known_high_school_stats.reset_index().drop(columns = ['index'])

# Printing the head of the new dataframe
known_high_school_stats.head()

Unnamed: 0,SchoolCode,SchoolName,school_teaches_cs,total_cs_courses,yearly_enrolled_in_cs
0,1519,Edmonds eLearning Academy,Teaches Computer Science,1,1
1,1547,Middle College High School,Teaches Computer Science,4,58
2,1627,Yelm Extension School,Teaches Computer Science,2,3
3,1628,Dishman Hills High School,Teaches Computer Science,6,158
4,1640,Puyallup Online Academy/POA,Teaches Computer Science,2,6


## Part 12: Listing High Schools we do not have statistics on

As mentioned earlier, we do not have course information on all schools. It is important for us to also include the schools we do not have information on. This is so that it is easy to identify which Local Education Agency's are not collecting and providing information on their schools.


In [75]:
# Making a copy of the high_schools dataframe
unknown_high_schools = high_schools.copy()

# We are only retaining the high schools on which we have no information
unknown_high_schools = unknown_high_schools[~unknown_high_schools.SchoolCode.isin(known_high_school_stats.SchoolCode)]

# We are only keeping the SchoolCode and SchoolName columns as only those are needed for the overall Washington State High School statistics 
# dataframe we are making at the moment. 
unknown_high_schools = unknown_high_schools[["SchoolCode", "SchoolName"]]

# We are adding the total_cs_courses column and setting the value as NA 
# as we have no information on whether the school teaches computer science or not
unknown_high_schools["total_cs_courses"] = np.nan

# We are adding the yearly_enrolled_in_cs column and setting the value as NA 
# as we have no information on whether the school teaches computer science or not

unknown_high_schools["yearly_enrolled_in_cs"] = np.nan

# We are adding the school_teaches_cs column and setting the value as NA 
# as we have no information on whether the school teaches computer science or not

unknown_high_schools["school_teaches_cs"] = "No Information Available"

# Resetting Index
unknown_high_schools = unknown_high_schools.reset_index().drop(columns = ['index'])

# Printing the head
unknown_high_schools.head()

289


Unnamed: 0,SchoolCode,SchoolName,total_cs_courses,yearly_enrolled_in_cs,school_teaches_cs
0,3204,Rosalia Elementary & Secondary School,,,No Information Available
1,4040,West Valley Jr High,,,No Information Available
2,1910,Marysville SD Special,,,No Information Available
3,1904,Parent Partnership,,,No Information Available
4,1932,Columbia Virtual Academy,,,No Information Available


## Part 13: Combining High Schools we have statistics on with High Schools we do not have statistics on

**We are now combining the dataframes of schools we have statistics on with the schools we do not have statistics on. This will be the complete dataset which has a list of all the high schools.**

In [78]:
all_high_school_stats = pd.concat([known_high_school_stats, unknown_high_schools], sort=False)
all_high_school_stats = all_high_school_stats.reset_index().drop(columns = ['index'])
all_high_school_stats.head()

Unnamed: 0,SchoolCode,SchoolName,school_teaches_cs,total_cs_courses,yearly_enrolled_in_cs
0,1519,Edmonds eLearning Academy,Teaches Computer Science,1.0,1.0
1,1547,Middle College High School,Teaches Computer Science,4.0,58.0
2,1627,Yelm Extension School,Teaches Computer Science,2.0,3.0
3,1628,Dishman Hills High School,Teaches Computer Science,6.0,158.0
4,1640,Puyallup Online Academy/POA,Teaches Computer Science,2.0,6.0


## Part 14: Combining High School Statistics with School Information

**1. In the cell below, we will be merging the `all_high_school_stats` dataframe to the `high_schools` dataframe on the SchoolCode column. This will be done so that we are able to add in key metadata for each school such as the Principal Name, Email ID, City.**

In [87]:
all_high_school_stats_and_info = pd.merge(all_high_school_stats, high_schools, how = 'outer', on = 'SchoolCode')

**2. The values in the School Name, Principal Name, Email, City are not consistent in casing. Some are upper case while some are in Title Case. We want our data in each column to be in a consistent format. Therefore we have fixed the casing of these columns for consistency.**

In [88]:
all_high_school_stats_and_info["SchoolName_x"] = all_high_school_stats_and_info["SchoolName_x"].str.title()
all_high_school_stats_and_info["SchoolName_y"] = all_high_school_stats_and_info["SchoolName_y"].str.title()
all_high_school_stats_and_info["PrincipalName"] = all_high_school_stats_and_info["PrincipalName"].str.title()
all_high_school_stats_and_info["Email"] = all_high_school_stats_and_info["Email"].str.lower()
all_high_school_stats_and_info["City"] = all_high_school_stats_and_info["City"].str.title()

print(len(all_high_school_stats_and_info))
#Printing the Head
all_high_school_stats_and_info.head()

660


Unnamed: 0,SchoolCode,SchoolName_x,school_teaches_cs,total_cs_courses,yearly_enrolled_in_cs,LEACode,LEAName,SchoolName_y,LowestGrade,HighestGrade,PrincipalName,Email,Phone,OrgCategoryList,GradeCategory,City
0,1519,Edmonds Elearning Academy,Teaches Computer Science,1.0,1.0,31015,Edmonds School District,Edmonds Elearning Academy,9,12,Katie Bjornstad,bjornstadk@edmonds.wednet.edu,425.431.1528,"Alternative School, Public School",High School,Lynnwood
1,1547,Middle College High School,Teaches Computer Science,4.0,58.0,17001,Seattle Public Schools,Middle College High School,9,12,Elizabeth Mcfarland,emmcfarland@seattleschools.org,206.252.9905,"Public School, Regular School",High School,Seattle
2,1627,Yelm Extension School,Teaches Computer Science,2.0,3.0,34002,Yelm School District,Yelm Extension School,9,12,Ryan Akiyama,ryan_akiyama@ycs.wednet.edu,360.458.7777,"Alternative School, Public School",High School,Yelm
3,1628,Dishman Hills High School,Teaches Computer Science,6.0,158.0,32363,West Valley School District (Spokane),Dishman Hills High School,9,12,Lauren House,lauren.house@wvsd.org,509.927.1100,"Alternative School, Public School",High School,Spokane
4,1640,Puyallup Online Academy/Poa,Teaches Computer Science,2.0,6.0,27003,Puyallup School District,Puyallup Online Academy/Poa,K,12,Adriana Julian,juliaac@puyallup.k12.wa.us,253.841.8630,"Alternative School, Public School",K-12,Puyallup


## Part 15: Listing Schools in Washington for which we have Geo-Data

**For us to map these schools, we need to add key geo-spatial data such as latitude and longitude. We have sourced this geodata for K-12 schools from [Washington State Data Gov Website](https://geo.wa.gov/datasets/OSPI::k-12-schools). Below is the description of each column name.**

| **Column Name** | **Column Description** | 
| --- | --- | 
| X | | Longitude of the School Location |
| Y | | Latitude of the School Location  | 
| FID | Unique ID of the School in this dataset |
| School Code | Washing State School Code of this School |
| Latitude | Latitude of School | 
| Longitude | Longitude of School |
| ESDCode | Education Service District Code of the School | 
| ESDName | Education Service District Code of the School | 
| LEACode | Local Education Agency Code for the School | 
| SchoolName | Name of the School | 
| LowestGrad | Lowest Grade of the School |
| HighestGra | Highest Grade of the School |
| AddressLin | Address Line 1 of the School | 
| AddressL_1 | Address Line 2 of the School (optional)| 
| City | City of the School |
| State | State of the School | 
| ZipCode | ZipCode of the School | 
| PricipalN | Principal Name of the School |
| Email | Email ID of the Principal | 
| Phone | Phone Number of the School | 
| OrgCategor | Type of School by Organization Type e.g Public School, Re-Engagement School	 | 
| AYPCode | Adequate Yearly Progress Code | 
| GradeCateg | Grade Category of the School |
| OrgCateg_1 | Organization Category of the School |

In [98]:
wa_school_geo_data = pd.read_csv("data/labelled_data/School_Data/WA_K12_Schools_Geo_Data.csv")
print(wa_school_geo_data.columns)
wa_school_geo_data.transpose()

Index(['X', 'Y', 'FID', 'SchoolCode', 'Latitude', 'Longitude', 'ESDCode',
       'ESDName', 'LEACode', 'LEAName', 'SchoolName', 'LowestGrad',
       'HighestGra', 'AddressLin', 'AddressL_1', 'City', 'State', 'ZipCode',
       'PrincipalN', 'Email', 'Phone', 'OrgCategor', 'AYPCode', 'GradeCateg',
       'OrgCateg_1'],
      dtype='object')


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2443,2444,2445,2446,2447,2448,2449,2450,2451,2452
X,-119.196,-122.355,-122.461,-117.559,-122.917,-122.446,-122.336,-119.278,-122.341,-122.341,...,-117.29,-119.104,-119.907,-119.609,-122.358,-123.299,-119.305,-122.623,-122.177,-120.603
Y,46.2244,47.2118,45.5932,47.809,46.9946,47.2558,48.4182,46.2998,47.4964,47.4964,...,47.6977,46.2385,48.0519,45.9386,47.5626,46.575,47.1062,47.3869,47.8789,47.5681
FID,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000
SchoolCode,4007,5549,5534,5417,5305,1816,3363,3469,5277,5254,...,3194,2917,2396,2133,1620,2858,2832,2681,4530,2760
Latitude,46.2244,47.2119,45.5932,47.809,46.9946,47.2558,48.4183,46.2998,47.4964,47.4964,...,47.6977,46.2386,48.0519,45.9386,47.5626,46.575,47.1062,47.3869,47.8789,47.5681
Longitude,-119.196,-122.355,-122.461,-117.559,-122.917,-122.446,-122.336,-119.278,-122.341,-122.341,...,-117.29,-119.104,-119.907,-119.609,-122.358,-123.299,-119.305,-122.623,-122.177,-120.603
ESDCode,11801,OSPI,06801,32801,OSPI,17801,OSPI,11801,17801,17801,...,32801,11801,04801,11801,17801,34801,04801,17801,29801,04801
ESDName,Educational Service District 123,Office of Superintendent of Public Instruction,Educational Service District 112,Educational Service District 101,Office of Superintendent of Public Instruction,Puget Sound Educational Service District 121,Office of Superintendent of Public Instruction,Educational Service District 123,Puget Sound Educational Service District 121,Puget Sound Educational Service District 121,...,Educational Service District 101,Educational Service District 123,North Central Educational Service District 171,Educational Service District 123,Puget Sound Educational Service District 121,Capital Region ESD 113,North Central Educational Service District 171,Puget Sound Educational Service District 121,Northwest Educational Service District 189,North Central Educational Service District 171
LEACode,3017,27901,6117,32325,34801,27010,29801,3400,17401,17401,...,32363,11001,24122,3050,17001,21301,13161,27401,31002,4228
LEAName,Kennewick School District,Chief Leschi Tribal Compact,Camas School District,Nine Mile Falls School District,Capital Region ESD 113,Tacoma School District,Northwest Educational Service District 189,Richland School District,Highline School District,Highline School District,...,West Valley School District (Spokane),Pasco School District,Pateros School District,Paterson School District,Seattle Public Schools,Pe Ell School District,Moses Lake School District,Peninsula School District,Everett School District,Cascade School District


In [102]:
all_high_school_stats_and_geo_data = pd.merge(all_high_school_stats, wa_school_geo_data, how = 'left', on = 'SchoolCode')
all_high_school_stats_and_geo_data.columns

Index(['SchoolCode', 'SchoolName_x', 'school_teaches_cs', 'total_cs_courses',
       'yearly_enrolled_in_cs', 'X', 'Y', 'FID', 'Latitude', 'Longitude',
       'ESDCode', 'ESDName', 'LEACode', 'LEAName', 'SchoolName_y',
       'LowestGrad', 'HighestGra', 'AddressLin', 'AddressL_1', 'City', 'State',
       'ZipCode', 'PrincipalN', 'Email', 'Phone', 'OrgCategor', 'AYPCode',
       'GradeCateg', 'OrgCateg_1'],
      dtype='object')

## Merging GeoData to List of School Statistics and Information

In [None]:
wa_high_school_stats_info_with_geo = pd.merge(all_high_school_stats_and_info,wa_school_geo_data, how = 'left', on = 'SchoolCode')
#wa_high_school_stats_info_with_geo = wa_high_school_stats_info_with_geo.drop(columns=['SchoolName_x'])

wa_high_school_stats_info_with_geo.columns

In [None]:
#wa_high_school_stats_info_with_geo = wa_high_school_stats_info_with_geo.rename(columns = {'SchoolCode': 'School_Code',
#                                                                                         'total_cs_courses':'Total_CS_Courses',
 #                                                                                        'yearly_enrolled_in_cs': 'Yearly_Enrolled_In_CS',
  #                                                                                       'school_teaches_cs': 'School_Teaches_CS'})


In [None]:
wa_high_school_stats_info_with_geo.to_csv("all_data_latest.csv")