# **Python Basics for City Data**: An Introductory Tutorial Using Chicago's Municipal Datasets
# Tutorial Part I

***

### **Step 1: Import Necessary Libraries**

Although libraries only need to be installed once, we need to import them in every new Jupyter Notebook or Python script. 


In [None]:
import pandas as pd                   #Pandas: Provides data structures and data analysis tools (e.g., DataFrame for tabular data).
import numpy as np                    #NumPy: Provides support for arrays, matrices, and high-level mathematical functions to operate on these data structures.
import seaborn as sns                 #Seaborn: Based on Matplotlib, it provides a high-level interface for drawing attractive and informative statistical graphics.
from matplotlib import pyplot as plt  #Matplotlib's pyplot: Used for creating static, interactive, and animated visualizations in Python.

***

### **Step 2: Load and Inspect the Data**

####  2.1

- We will load the dataset `Chicago_Public_Schools_-_School_Progress_Reports_SY2223_20240122.csv` into a Pandas DataFrame using the `pd.read_csv()` function from the Pandas library. 

- Specify the file path as `'Data/Chicago_Public_Schools_-_School_Progress_Reports_SY2223_20240122.csv'`. 

- We will assign the resulting DataFrame to the variable `cps_data`. This will enable us to work with the CSV file data as a DataFrame in Pandas.

In [None]:
cps_data = pd.read_csv('../Data/<name of .CSV file>') #add the name of the .CSV file 

#### 2.2 

We'll take a look at the first few rows by using the name of the dataframe (`cps_data`) and the `.head()` function to understand the structure of the data.

In [None]:
<dataframe name>.head() 

#### 2.3 

The dataset is very big, but we only want to use some of the data. 

- We can take a look at all of the columnsby using `.columns` after the name of the dataframe.

- Adding `.tolist()` to the end of the dataframe allows us to more easily read the data as a list<sup>*</sup>

###### * Click "scrollable element" in VS code to scroll through the entire list.

In [None]:
<dataframe name>.columns. #add function at the end that makes the array a list 

*** 

### **Step 3: Data Cleaning and Preperation**

Before analyzing the data, let's ensure it is clean and formatted correctly.

#### 3.1 

In this notebook, we only want to work with the following columns:

- `School_ID`: Unique identifier for each school.

- `Short_Name`: Abbreviated name of the school, used for easier identification.

- `Zip`: Postal zip code where the school is located.

- `Primary_Category`: Type of school, typically categorized as High School (HS), Middle School (MS), Elementary School (ES).

- `Attainment_All_Grades_School_Pct`: The percentile ranking of the school's overall performance across all grades compared to the national average.

- `Attainment_PSAT_Grade_9_School_Pct`: The percentile ranking of the school's performance for 9th grade students taking the PSAT, compared to the national average.

- `Attainment_PSAT_Grade_10_School_Pct`: The percentile ranking of the school's performance for 10th grade students taking the PSAT, compared to the national average.

- `School_Survey_Involved_Families`: Survey results indicating the level of family involvement and engagement within the school community.

- `School_Survey_Safety`: Survey results indicating the perceived safety within the school.

- `School_Survey_Supportive_Environment`: Survey results indicating the perceived supportiveness of the school environment.

We can select these columns from the `cps_data` dataframe and assign it to the same variable name.


In [None]:
cps_data = cps_data[['School_ID','Short_Name','Zip',...]] #add the additional columns... 

cps_data.head()

#### 3.2

For this dataset, we want to work only with High Schools. We can filter the dataset by "dropping" all of the schools that have `ES` in the `Primary Category` column.

In [None]:
#drop Elementary Schools 'ES'
cps_data = cps_data.drop(cps_data[cps_data['Primary_Category'] == 'ES'].index)

#drop Middle Schools'MS'
cps_data =  #Drop the rows with MS in the 'Primary_Category' column.


cps_data = cps_data.reset_index(drop=True) #Resets a new index and drop the previous index 
cps_data = cps_data.dropna() # #Drop Rows with NA values


cps_data.head()

#### 3.3

Next, we will convert school survey responses to an ordinal scale for analysis. 

In [None]:
#Use to see all possible survey responses that will need to be converted to scale

cps_data.<column>.unique() #use any survey-related column 

In [None]:
# Convert school survey responses to an ordinal scale for analysis
survey_map = {'VERY STRONG': 5, 'STRONG': 4,...} #complete the map with the other response values from the survey 
cps_data['School_Survey_Involved_Families'] = <dataframe>['School_Survey_Involved_Families'].map(<variable_for_survey_map>)
cps_data['School_Survey_Safety'] = 
cps_data['School_Survey_Supportive_Environment'] =

cps_data.head()

#### 3.4 

Next, we'll preform a data cleaning operation on the `cps_data` DataFrame, specifically targeting the columns listed in the attainment_columns list, which includes columns related to school attainment percentages. 

We will also fill missing values for attainment percentages with the mean of each column. 
* `.apply(lambda x: ...)`: The apply function is used to apply a function along the axis of the DataFrame (by default, column-wise). Here, it applies a lambda function to each column selected by cps_data[attainment_columns].

* `lambda x: x.fillna(x.mean())`: This lambda function takes each column x (which represents one of the attainment metrics) and replaces all its missing values (NaN values) with the mean of that column. The fillna function is used for replacing missing values, and x.mean() computes the mean of the column excluding NaN values.

In [None]:
# Designate the attainment columns. 
attainment_columns = ['Attainment_All_Grades_School_Pct',....] #fill in with all attainment-related columns 

#Fill in missing values with the mean of each column.
cps_data[attainment_columns] = cps_data[attainment_columns].apply(lambda x: <function to fill NaN values with the mean of that column>)

cps_data[attainment_columns].head()

***

### **Step 4: Data Analysis**
Let's perform some basic data analysis to understand the performance and perceptions in different schools.

#### 4.1 School Performance by Zip Code

In [None]:
# Calculate average attainment for all grades by zip code
average_attainment_by_zip = cps_data.groupby(<column you want to group by>)[attainment_columns].mean()

# PLOTTING
#<plot_type> Include 'line','scatter','bar','box', 'hist', and'pie'. Choose which is BEST. 
#<x>,<y>: Choose the dimensions of the figure 
average_attainment_by_zip.plot(kind=<plot_type>, figsize=(<x>, <y>), title=<Figure Title>) #<Figure Title>: Choose title for figure 


plt.ylabel('<yTitle>') #<yTitle>: Choose label for Y axis 
plt.show()

#### 4.2 Correlation Between School Surveys and Performance

Let's examine if there's any correlation between the school surveys and performance.

In [None]:
# 1. Prepare a subset of relevant columns
subset_data = cps_data[[<Relevant Columns>]]

# 2. Calculate correlation matrix
correlation_matrix = subset_data.corr()

# 3. Plot the correlation matrix
plt.figure___ #fill in
plt.title___ #fill in 
#cmap options include: viridis, plasma,inferno,magma,cividis,coolwarm,flag,ocean, and more! 
sns.heatmap(correlation_matrix, annot=True, cmap=<cmap>, fmt=".2f") #fmt 
plt.show()

#### 4.3 School Performance Distribution

Finally, let's visualize the distribution (histogram) of school performances across all grades.

In [None]:
# Plotting distribution of school performance (the attainment) across all grades
cps_data[<column_name>].plot(kind=<plot_type>, bins=<bin_number>, figsize=, title=<Title>)
plt.xlabel___
plt.ylabel___
plt.show()


*** 
### Conclusion

With a few extra lines of code, we can put these plots in subplots that are all in **one** figure. This tutorial walked you through the basic steps of loading, cleaning, and analyzing the Chicago Public Schools performance data. Through this analysis, we can gain insights into how schools perform across different regions and how different factors might correlate with school performance. Feel free to explore further and pose new questions as you dive deeper into the data.