# <center>________________________________________________________________</center>

# <center>ANALYSIS OF CHICAGO PUBLIC DATASETS</center>

# <center>________________________________________________________________</center>

## Datasets
***

### 1. Socioeconomic Indicators in Chicago (2008-2012)

The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal.
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.

A detailed description of the dataset can be found on [the city of Chicago's website](https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2), but to summarize, the dataset has the following variables:

*   **Community Area Number** (`ca`): Used to uniquely identify each row of the dataset

*   **Community Area Name** (`community_area_name`): The name of the region in the city of Chicago

*   **Percent of Housing Crowded** (`percent_of_housing_crowded`): Percent of occupied housing units with more than one person per room

*   **Percent Households Below Poverty** (`percent_households_below_poverty`): Percent of households living below the federal poverty line

*   **Percent Aged 16+ Unemployed** (`percent_aged_16_unemployed`): Percent of persons over the age of 16 years that are unemployed

*   **Percent Aged 25+ without High School Diploma** (`percent_aged_25_without_high_school_diploma`): Percent of persons over the age of 25 years without a high school education

*   **Percent Aged Under** 18 or Over 64:Percent of population under 18 or over 64 years of age (`percent_aged_under_18_or_over_64`): (ie. dependents)

*   **Per Capita Income** (`per_capita_income_`): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population

*   **Hardship Index** (`hardship_index`): Score that incorporates each of the six selected socioeconomic indicators

### 2. Chicago Public Schools - Progress Report Cards (2011-2012)

The city of Chicago released a dataset showing all school level performance data used to create School Report Cards for the 2011-2012 school year. The dataset is available from the Chicago Data Portal: [https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t](https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t)

This dataset includes a large number of metrics. The glossary can be found [here](https://data.cityofchicago.org/api/assets/AAD41A13-BE8A-4E67-B1F5-86E711E09D5F?download=true).

### 3. Chicago Crime Data (2001-Present)

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
[https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)

### Download the datasets

We will download the datasets from the links below, instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly. This way we can focus precisely on SQL queries instead of data wrangling.

*   <a href="https://github.com/efeyemez/Portfolio/blob/main/Datasets/Chicago_Public_Datasets/ChicagoCensusData.csv" target="_blank">Chicago Census Data</a>

*   <a href="https://github.com/efeyemez/Portfolio/blob/main/Datasets/Chicago_Public_Datasets/ChicagoPublicSchools.csv" target="_blank">Chicago Public Schools</a>

*   <a href="https://github.com/efeyemez/Portfolio/blob/main/Datasets/Chicago_Public_Datasets/ChicagoCrimeData.csv" target="_blank">Chicago Crime Data</a>

## Libraries
***

In [None]:
#!pip install pandas
#!pip install sqlalchemy
#!pip install sqlite3
#!pip install ipython-sql
#!pip install matplotlib
#!pip install seaborn

In [None]:
import csv, sqlite3
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## Creating and Connecting to Database
***

In [None]:
%load_ext sql

In [None]:
con = sqlite3.connect("chicago.db")
cur = con.cursor()

In [None]:
%sql sqlite:///chicago.db

## Data Acquisation
***

In [None]:
df_census = pd.read_csv('https://github.com/efeyemez/Portfolio/raw/main/Datasets/Chicago_Public_Datasets/ChicagoCensusData.csv')
print(df_census.shape)
df_census.head(3)

In [None]:
df_schools = pd.read_csv('https://github.com/efeyemez/Portfolio/raw/main/Datasets/Chicago_Public_Datasets/ChicagoPublicSchools.csv')
print(df_schools.shape)
df_schools.head(3)

In [None]:
df_crime = pd.read_csv('https://github.com/efeyemez/Portfolio/raw/main/Datasets/Chicago_Public_Datasets/ChicagoCrimeData.csv')
print(df_schools.shape)
df_schools.head(3)

## Importing Datasets into the Database
***

In [None]:
df_census.to_sql("CENSUS_DATA", con, if_exists='replace', index=False, method="multi")
df_schools.to_sql("CHICAGO_PUBLIC_SCHOOLS", con, if_exists='replace', index=False, method="multi")
df_crime.to_sql("CHICAGO_CRIME_DATA", con, if_exists='replace', index=False, method="multi")

The database system catalog:

In [None]:
%sql SELECT name FROM sqlite_master WHERE type='table';

# QUERIES FOR CENSUS DATA
***

## Column metadata

In [None]:
%sql SELECT name,type,length(type) FROM PRAGMA_TABLE_INFO('CENSUS_DATA') LIMIT 5;

## How many records are in the dataset?

In [None]:
%sql SELECT COUNT(*) FROM CENSUS_DATA;

## What is the maximum value of hardship index in the dataset?

In [None]:
%sql SELECT MAX(HARDSHIP_INDEX) FROM CENSUS_DATA;

## Which community area has the highest hardship index? (with a sub-query)

In [None]:
%%sql SELECT COMMUNITY_AREA_NAME, HARDSHIP_INDEX FROM CENSUS_DATA
        WHERE HARDSHIP_INDEX = (SELECT MAX(HARDSHIP_INDEX) FROM CENSUS_DATA)

## How many community areas have a hardship index greater than 50?

In [None]:
%sql SELECT COUNT(*) FROM CENSUS_DATA WHERE HARDSHIP_INDEX > 50.0;

## Which community areas have per-capita income greater than $60,000?

In [None]:
%sql SELECT COMMUNITY_AREA_NAME FROM CENSUS_DATA WHERE PER_CAPITA_INCOME > 60000;

## Which community areas have per capita income less than $11,000?

In [None]:
%%sql SELECT COMMUNITY_AREA_NAME, PER_CAPITA_INCOME FROM CENSUS_DATA
        WHERE PER_CAPITA_INCOME < 11000;

## List 5 community areas with highest % of households below poverty line

In [None]:
%%sql SELECT COMMUNITY_AREA_NAME, PERCENT_HOUSEHOLDS_BELOW_POVERTY FROM CENSUS_DATA
        ORDER BY PERCENT_HOUSEHOLDS_BELOW_POVERTY DESC LIMIT 5;

## Scatter plot of per-capita income and hardship_index

In [None]:
income_vs_hardship = %sql SELECT PER_CAPITA_INCOME, HARDSHIP_INDEX FROM CENSUS_DATA;

fig = sns.jointplot(x='PER_CAPITA_INCOME', y='HARDSHIP_INDEX', data=income_vs_hardship.DataFrame())

fig.set_axis_labels("Per Capita Income", "Hardship Index")

# Display the plot
plt.show()

## Scatter plot of per-capita income and percent households below poverty

In [None]:
income_vs_poverty = %sql SELECT PER_CAPITA_INCOME, PERCENT_HOUSEHOLDS_BELOW_POVERTY FROM CENSUS_DATA;

fig = sns.jointplot(x='PER_CAPITA_INCOME',y='PERCENT_HOUSEHOLDS_BELOW_POVERTY', data=income_vs_poverty.DataFrame())

fig.set_axis_labels("Per Capita Income", "Percent Households Below Poverty")

# Display the plot
plt.show()

## Scatter plot of per-capita income and percent aged 16+ unemployed

In [None]:
income_vs_plus16 = %sql SELECT PER_CAPITA_INCOME, PERCENT_AGED_16__UNEMPLOYED FROM CENSUS_DATA;

fig = sns.jointplot(x='PER_CAPITA_INCOME',y='PERCENT_AGED_16__UNEMPLOYED', data=income_vs_plus16.DataFrame())

fig.set_axis_labels("Per Capita Income", "Percent Aged 16+ Unemployed")

# Display the plot
plt.show()

# QUERIES FOR PUBLIC SCHOOLS DATA
***

## Column metadata

In [None]:
%sql SELECT name,type,length(type) FROM PRAGMA_TABLE_INFO('CHICAGO_PUBLIC_SCHOOLS') LIMIT 5;

## What is the highest "Safety Score"?

In [None]:
%sql SELECT MAX(SAFETY_SCORE) AS MAX_SAFETY_SCORE FROM CHICAGO_PUBLIC_SCHOOLS;

## Which schools have highest "Safety Score"?

In [None]:
%%sql SELECT NAME_OF_SCHOOL, SAFETY_SCORE FROM CHICAGO_PUBLIC_SCHOOLS
        WHERE SAFETY_SCORE = (SELECT MAX(SAFETY_SCORE) FROM CHICAGO_PUBLIC_SCHOOLS);

## What are the 5 schools with lowest "Safety Score"?

In [None]:
%%sql SELECT NAME_OF_SCHOOL, SAFETY_SCORE FROM CHICAGO_PUBLIC_SCHOOLS
        ORDER BY SAFETY_SCORE NULLS LAST LIMIT 5

## List the average safety score for each type of school.

In [None]:
%%sql SELECT School_Type, AVG(SAFETY_SCORE) AS Average_Safety_Score FROM CHICAGO_PUBLIC_SCHOOLS
        GROUP BY School_Type;

## How many elementary schools are in the dataset?

In [None]:
%sql SELECT COUNT(*) FROM CHICAGO_PUBLIC_SCHOOLS WHERE "School_Type" = 'ES';

## What are the top 10 schools with the highest "Average Student Attendance"?

In [None]:
%%sql SELECT NAME_OF_SCHOOL, AVERAGE_STUDENT_ATTENDANCE FROM CHICAGO_PUBLIC_SCHOOLS
        ORDER BY AVERAGE_STUDENT_ATTENDANCE DESC NULLS LAST LIMIT 10;

## What are the bottom 5 Schools with the lowest "Average Student Attendance"?

In [None]:
%%sql SELECT NAME_OF_SCHOOL, AVERAGE_STUDENT_ATTENDANCE FROM CHICAGO_PUBLIC_SCHOOLS
        ORDER BY AVERAGE_STUDENT_ATTENDANCE ASC NULLS LAST LIMIT 5;

## Remove the '%' sign from the above result set for "Average Student Attendance" column

In [None]:
%%sql SELECT NAME_OF_SCHOOL, REPLACE(AVERAGE_STUDENT_ATTENDANCE, '%', '') AS 'Average Student Attendance (%)' FROM CHICAGO_PUBLIC_SCHOOLS
        ORDER BY AVERAGE_STUDENT_ATTENDANCE ASC NULLS LAST LIMIT 5;

## Which schools have "Average Student Attendance" lower than 70%?

In [None]:
%%sql SELECT NAME_OF_SCHOOL, AVERAGE_STUDENT_ATTENDANCE FROM CHICAGO_PUBLIC_SCHOOLS
        WHERE CAST(REPLACE(AVERAGE_STUDENT_ATTENDANCE, '%', '') AS DOUBLE) < 70
        ORDER BY AVERAGE_STUDENT_ATTENDANCE;

## What is the total "College Enrollment" for each community area? (show 5)

In [None]:
%%sql SELECT COMMUNITY_AREA_NAME, SUM(COLLEGE_ENROLLMENT) AS TOTAL_ENROLLMENT FROM CHICAGO_PUBLIC_SCHOOLS
        GROUP BY COMMUNITY_AREA_NAME LIMIT 5;

## What are the 5 community areas with the least total "College Enrollment"? (sorted in ascending order)

In [None]:
%%sql SELECT COMMUNITY_AREA_NAME, SUM(COLLEGE_ENROLLMENT) AS TOTAL_ENROLLMENT FROM CHICAGO_PUBLIC_SCHOOLS
        GROUP BY COMMUNITY_AREA_NAME ORDER BY TOTAL_ENROLLMENT ASC LIMIT 5;

# QUERIES FOR CRIME DATA
***

## Column metadata

In [None]:
%sql SELECT name,type,length(type) FROM PRAGMA_TABLE_INFO('CHICAGO_CRIME_DATA') LIMIT 5;

## What is the total number of crimes recorded in the table?

In [None]:
%sql SELECT DISTINCT(COUNT(ID)) FROM CHICAGO_CRIME_DATA;

## What are the case numbers for crimes involving minors? (children are not considered minors for the purposes of crime analysis)

In [None]:
%%sql SELECT CASE_NUMBER, PRIMARY_TYPE, DESCRIPTION FROM CHICAGO_CRIME_DATA
        WHERE PRIMARY_TYPE LIKE '%MINOR%' OR DESCRIPTION LIKE '%MINOR%';

## What are the kidnapping crimes involving a child?

In [None]:
%%sql SELECT * FROM CHICAGO_CRIME_DATA
        WHERE PRIMARY_TYPE LIKE '%KIDNAP%' AND DESCRIPTION LIKE '%CHILD%';

## What kinds of crimes were recorded at schools?

In [None]:
%%sql SELECT DISTINCT(PRIMARY_TYPE), LOCATION_DESCRIPTION FROM CHICAGO_CRIME_DATA
        WHERE LOCATION_DESCRIPTION LIKE '%SCHOOL%';

# QUERIES FOR MULTIPLE TABLES
***

## What is the hardship index for the community area which has street adress of '3630 S Wells St'?

In [None]:
%%sql SELECT CD.COMMUNITY_AREA_NUMBER, CD.HARDSHIP_INDEX, CPS.Street_Address FROM CENSUS_DATA CD, CHICAGO_PUBLIC_SCHOOLS CPS
        WHERE CD.COMMUNITY_AREA_NUMBER = CPS.COMMUNITY_AREA_NUMBER AND CPS.Street_Address = '3630 S Wells St';

## What the hardship index for the community area which has the school with the highest college enrollment?

In [None]:
%%sql SELECT COMMUNITY_AREA_NUMBER, COMMUNITY_AREA_NAME, HARDSHIP_INDEX FROM CENSUS_DATA
        WHERE COMMUNITY_AREA_NUMBER IN(SELECT COMMUNITY_AREA_NUMBER FROM CHICAGO_PUBLIC_SCHOOLS ORDER BY COLLEGE_ENROLLMENT DESC LIMIT 1);

## Which community area is most crime prone?

In [None]:
%%sql

SELECT CD.COMMUNITY_AREA_NAME, CD.COMMUNITY_AREA_NUMBER,
(SELECT COUNT(*) FROM CHICAGO_CRIME_DATA GROUP BY COMMUNITY_AREA_NUMBER ORDER BY COUNT(*) DESC LIMIT 1)  AS NUMBER_OF_CRIMES 
FROM CENSUS_DATA CD, CHICAGO_CRIME_DATA CCD
    WHERE CD.COMMUNITY_AREA_NUMBER = (SELECT COMMUNITY_AREA_NUMBER FROM CHICAGO_CRIME_DATA
                                             GROUP BY COMMUNITY_AREA_NUMBER ORDER BY COUNT(*) DESC LIMIT 1) LIMIT 1;

# <center>________________________________________________________________</center>