## Analyzing chicago city data with SQL and Python

You work as a Data Engineer for the City of Chicago. You have been tasked with using three datasets that are available on the city of Chicago's Data Portal to answer some questions. The data is listed below:
1. <a href="https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDB0201ENSkillsNetwork22-2022-01-01">Socioeconomic Indicators in Chicago</a>
1. <a href="https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDB0201ENSkillsNetwork22-2022-01-01">Chicago Public Schools</a>
1. <a href="https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDB0201ENSkillsNetwork22-2022-01-01">Chicago Crime Data</a>

### 1. Socioeconomic Indicators in Chicago
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

For this assignment you will use a snapshot of this dataset which can be downloaded from:<a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork/labs/FinalModule_SKO/data/Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012-v2.csv" target="_blank"> Census Data </a>

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2



### 2. Chicago Public Schools

This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year. This dataset is provided by the city of Chicago's Data Portal.

For this assignment you will use a snapshot of this dataset which can be downloaded from: <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork/labs/FinalModule_SKO/data/Chicago_Public_Schools_-_Progress_Report_Cards__2011-2012-v3.csv" target="_blank"> Chicago Public School </a>

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t




### 3. Chicago Crime Data 

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. 

This dataset is quite large - over 1.5GB in size with over 6.5 million rows. For the purposes of this assignment we will use a much smaller sample of this dataset which can be downloaded from:<a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork/labs/FinalModule_SKO/data/Chicago_Crime_Data-v2.csv" target="_blank"> Chicago Crime Data </a>

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at:
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

### Download the datasets
In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. Click on the links below to download and save the datasets (.CSV files):
1. __CENSUS_DATA:__ <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork/labs/FinalModule_Coursera/data/Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012-v2.csv" target="_blank">Census Dataset</a>

1. __CHICAGO_PUBLIC_SCHOOLS__  <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork/labs/FinalModule_Coursera/data/Chicago_Public_Schools_-_Progress_Report_Cards__2011-2012-v3.csv" target="_blank"> Chicago Public School</a>

1. __CHICAGO_CRIME_DATA:__ <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork/labs/FinalModule_Coursera/data/Chicago_Crime_Data-v2.csv" target="_blank"> Chicago Crime Data </a>

__NOTE:__ Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment.



### Connect to the database using PostgreSQL database
Let us first load the SQL extension and establish a connection with the database

In [1]:
# Install connection to PostgreSQL database (local)
#!pip install psycopg2

# Import packages
import csv
import psycopg2
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning) 
print('Project libraries has been successfully installed!')

Project libraries has been successfully installed!


In [2]:
# Connect to the database
conn = psycopg2.connect(
    host = 'localhost',
    database = 'analysis', 
    user = 'postgres', 
    password = '@Mexx4u2nv',  
    port = '5432')

print('Connection to database is successfully')

Connection to database is successfully


In [3]:
# function to read from database
def read(conn, read_query):
    print('Read')
    cursor = conn.cursor()
    cursor.execute(read_query)
    for row in cursor:
        print(f'row = {row}')
    print()
    
# function to create in postgre database     
def create(conn, create_query):
    cursor = conn.cursor() # create cursor object
    cursor.execute(create_query) # execute query
    conn.commit() # commit query to database
    print('Table have been created successfull!!!')
    #read(conn)
    
# function to insert in postgre database     
def insert(conn, insert_query):
    cursor = conn.cursor()
    cursor.execute(insert_query)
    conn.commit()
    print('Records have been successfully inserted!!!')
    #read(conn)
    
# function to update table
def update(conn, update_query):
    print('Update')
    cursor = conn.cursor()
    cursor.execute(update_query)
    conn.commit()
    #read(conn)
    
# function to delete in postgre database
def delete(conn, delete_query):
    print('Delete')
    cursor = conn.cursor()
    cursor.execute(delete_query)
    conn.commit()
    #read(conn)

# close the cursor and connection to the server 
def close():
    cursor.close()
    conn.close()   
    
# function to create pandas dataframe
def create_pandas_df(sql_query, database=conn):
    table = pd.read_sql(sql_query, database)
    return table

### Create table in PostgreSQL database

In [4]:
# Create table Chicago Crime Data
create_query = '''
DROP TABLE IF EXISTS chicago_crime;
CREATE TABLE chicago_crime (
    ID INTEGER
    , CASE_NUMBER VARCHAR(10)
    , DATE TIMESTAMP
    , BLOCK VARCHAR(255)
    , IUCR VARCHAR(10)
    , PRIMARY_TYPE VARCHAR(255)
    , DESCRIPTION VARCHAR(255)
    , LOCATION_DESCRIPTION VARCHAR(255)
    , ARREST BOOLEAN
    , DOMESTIC BOOLEAN
    , BEAT INTEGER
    , DISTRICT INTEGER
    , WARD INTEGER
    , COMMUNITY_AREA_NUMBER INTEGER
    , FBICODE VARCHAR(5)
    , X_COORDINATE INTEGER
    , Y_COORDINATE INTEGER
    , YEAR INTEGER
    , UPDATEDON TIMESTAMP
    , LATITUDE DOUBLE PRECISION
    , LONGITUDE DOUBLE PRECISION
    , LOCATION VARCHAR(255)
);

'''
create(conn, create_query)


# PostgreSQL import CSV
cursor = conn.cursor()
with open('Chicago_Crime_Data-v2.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader) # Skip the header row.
    for row in reader:
        # Replace empty strings with None for all rows
        row = [None if val == '' or val == 'NDA' else val for val in row]
        cursor.execute(
            """INSERT INTO chicago_crime 
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",
        row
    )
conn.commit()
print('Chicago Crime data inserted into database successfully!!!')

Table have been created successfull!!!
Chicago Crime data inserted into database successfully!!!


In [5]:
# Create table Census data
create_query = '''
DROP TABLE IF EXISTS chicago_census;
CREATE TABLE chicago_census (
    COMMUNITY_AREA_NUMBER INTEGER
    , COMMUNITY_AREA_NAME VARCHAR(255)
    , "PERCENT OF HOUSING CROWDED" DECIMAL(5, 2)
    , "PERCENT HOUSEHOLDS BELOW POVERTY" DECIMAL(5, 2)
    , "PERCENT AGED 16+ UNEMPLOYED" DECIMAL(5, 2)
    , "PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA" DECIMAL(5, 2)
    , "PERCENT AGED UNDER 18 OR OVER 64" DECIMAL(5, 2)
    , PER_CAPITA_INCOME INTEGER
    , HARDSHIP_INDEX INTEGER
);
'''
create(conn, create_query)

# Insert CSV file into database
cursor = conn.cursor()
with open('Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012-v2.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader) # Skip the header row.
    for row in reader:
        # Replace empty strings with None for all rows
        row = [None if val == '' or val == 'NDA' else val for val in row]
        cursor.execute(
            """INSERT INTO chicago_census 
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)""",
        row
    )
conn.commit()
print('Chicago Census data inserted into database successfully!!!')

Table have been created successfull!!!
Chicago Census data inserted into database successfully!!!


In [6]:
# Read data 
read_query0 = '''
    SELECT *
    FROM chicago_crime
    LIMIT 5
    '''
chicago_crime_df = create_pandas_df(read_query0, conn)
chicago_crime_df.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area_number,fbicode,x_coordinate,y_coordinate,year,updatedon,latitude,longitude,location
0,3512276,HK587712,2004-08-28 17:50:56,047XX S KEDZIE AVE,890,THEFT,FROM BUILDING,SMALL RETAIL STORE,False,False,...,14,58,6,1155838,1873050,2004,2018-02-10 15:50:01,41.80744,-87.703956,"(41.8074405, -87.703955849)"
1,3406613,HK456306,2004-06-26 12:40:00,009XX N CENTRAL PARK AVE,820,THEFT,$500 AND UNDER,OTHER,False,False,...,27,23,6,1152206,1906127,2004,2018-02-28 15:56:25,41.89828,-87.716406,"(41.898279962, -87.716405505)"
2,8002131,HT233595,2011-04-04 05:45:00,043XX S WABASH AVE,820,THEFT,$500 AND UNDER,NURSING HOME/RETIREMENT HOME,False,False,...,3,38,6,1177436,1876313,2011,2018-02-10 15:50:01,41.815933,-87.624642,"(41.815933131, -87.624642127)"
3,7903289,HT133522,2010-12-30 16:30:00,083XX S KINGSTON AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,...,7,46,6,1194622,1850125,2010,2018-02-10 15:50:01,41.743665,-87.562463,"(41.743665322, -87.562462756)"
4,10402076,HZ138551,2016-02-02 19:30:00,033XX W 66TH ST,820,THEFT,$500 AND UNDER,ALLEY,False,False,...,15,66,6,1155240,1860661,2016,2018-02-10 15:50:01,41.773455,-87.70648,"(41.773455295, -87.706480471)"


In [7]:
# Read data 
read_query1 = '''
    SELECT *
    FROM chicago_census
    LIMIT 5
    '''
chicago_census_df = create_pandas_df(read_query1, conn)
chicago_census_df.head()

Unnamed: 0,community_area_number,community_area_name,PERCENT OF HOUSING CROWDED,PERCENT HOUSEHOLDS BELOW POVERTY,PERCENT AGED 16+ UNEMPLOYED,PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA,PERCENT AGED UNDER 18 OR OVER 64,per_capita_income,hardship_index
0,1,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39
1,2,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46
2,3,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20
3,4,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17
4,5,North Center,0.3,7.5,5.2,4.5,26.2,57123,6


#### Find the total number of crimes recorded in the CRIME table

In [8]:
query_0 = '''
    SELECT COUNT(DISTINCT id) AS total_crime
    FROM chicago_crime
    '''
create_pandas_df(query_0, conn)

Unnamed: 0,total_crime
0,533


#### Retrieve first 10 rows from the CRIME table

In [9]:
query_1 = '''
    SELECT * 
    FROM chicago_crime
    LIMIT 10
    '''
create_pandas_df(query_1, conn)

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area_number,fbicode,x_coordinate,y_coordinate,year,updatedon,latitude,longitude,location
0,3512276,HK587712,2004-08-28 17:50:56,047XX S KEDZIE AVE,890,THEFT,FROM BUILDING,SMALL RETAIL STORE,False,False,...,14,58,6,1155838,1873050,2004,2018-02-10 15:50:01,41.80744,-87.703956,"(41.8074405, -87.703955849)"
1,3406613,HK456306,2004-06-26 12:40:00,009XX N CENTRAL PARK AVE,820,THEFT,$500 AND UNDER,OTHER,False,False,...,27,23,6,1152206,1906127,2004,2018-02-28 15:56:25,41.89828,-87.716406,"(41.898279962, -87.716405505)"
2,8002131,HT233595,2011-04-04 05:45:00,043XX S WABASH AVE,820,THEFT,$500 AND UNDER,NURSING HOME/RETIREMENT HOME,False,False,...,3,38,6,1177436,1876313,2011,2018-02-10 15:50:01,41.815933,-87.624642,"(41.815933131, -87.624642127)"
3,7903289,HT133522,2010-12-30 16:30:00,083XX S KINGSTON AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,...,7,46,6,1194622,1850125,2010,2018-02-10 15:50:01,41.743665,-87.562463,"(41.743665322, -87.562462756)"
4,10402076,HZ138551,2016-02-02 19:30:00,033XX W 66TH ST,820,THEFT,$500 AND UNDER,ALLEY,False,False,...,15,66,6,1155240,1860661,2016,2018-02-10 15:50:01,41.773455,-87.70648,"(41.773455295, -87.706480471)"
5,7732712,HS540106,2010-09-29 07:59:00,006XX W CHICAGO AVE,810,THEFT,OVER $500,PARKING LOT/GARAGE(NON.RESID.),False,False,...,27,24,6,1171668,1905607,2010,2018-02-10 15:50:01,41.896447,-87.644939,"(41.896446772, -87.644938678)"
6,10769475,HZ534771,2016-11-30 01:15:00,050XX N KEDZIE AVE,810,THEFT,OVER $500,STREET,False,False,...,33,14,6,1154133,1933314,2016,2018-02-10 15:50:01,41.972845,-87.7086,"(41.972844913, -87.708600079)"
7,4494340,HL793243,2005-12-16 16:45:00,005XX E PERSHING RD,860,THEFT,RETAIL THEFT,GROCERY FOOD STORE,True,False,...,3,38,6,1180448,1879234,2005,2018-02-28 15:56:25,41.82388,-87.613504,"(41.823879885, -87.613503857)"
8,3778925,HL149610,2005-01-28 17:00:00,100XX S WASHTENAW AVE,810,THEFT,OVER $500,STREET,False,False,...,19,72,6,1160129,1838040,2005,2018-02-28 15:56:25,41.711281,-87.689179,"(41.711280513, -87.689179097)"
9,3324217,HK361551,2004-05-13 14:15:00,033XX W BELMONT AVE,820,THEFT,$500 AND UNDER,SMALL RETAIL STORE,False,False,...,35,21,6,1153590,1921084,2004,2018-02-28 15:56:25,41.939296,-87.710923,"(41.939295821, -87.710923442)"


#### How many crimes involve an arrest?

In [10]:
query_2 = '''
    SELECT COUNT(arrest) AS crime_involving_arrest
    FROM chicago_crime
    WHERE arrest = True
    '''
create_pandas_df(query_2, conn)

Unnamed: 0,crime_involving_arrest
0,163


#### Which unique types of crimes have been recorded at GAS STATION locations?

In [11]:
query_3 = '''
    SELECT DISTINCT primary_type AS unique_gas_station_crime
    FROM chicago_crime
    WHERE lower(location_description) LIKE 'gas%'
    '''
create_pandas_df(query_3, conn)

Unnamed: 0,unique_gas_station_crime
0,CRIMINAL TRESPASS
1,NARCOTICS
2,ROBBERY
3,THEFT


#### In the CENSUS data table, list all Community Areas whose names start with the letter 'B'

In [12]:
query_4 = '''
    SELECT community_area_name AS community_areas_starting_with_b
    FROM chicago_census
    WHERE community_area_name LIKE 'B%'
    '''
create_pandas_df(query_4, conn)

Unnamed: 0,community_areas_starting_with_b
0,Belmont Cragin
1,Burnside
2,Brighton Park
3,Bridgeport
4,Beverly


#### Which schools in Community Areas 10 to 15 are healthy school certified?

In [13]:
query_5 = '''
    SELECT cc.community_area_number
        , cc.community_area_name
        , cs."NAME_OF_SCHOOL" 
    FROM chicago_census cc
    INNER JOIN chicago_schools cs
        ON cc.community_area_number = cs."COMMUNITY_AREA_NUMBER"
    WHERE cc.community_area_number BETWEEN 10 AND 15 
        AND cs."HEALTHY_SCHOOL_CERTIFIED" = 'Yes'
    ORDER BY 1, 2, 3
    '''
create_pandas_df(query_5, conn)

Unnamed: 0,community_area_number,community_area_name,NAME_OF_SCHOOL
0,10,Norwood Park,Rufus M Hitch Elementary School


#### What is the average school Safety Score?

In [14]:
query_6 = '''
SELECT 
    ROUND(AVG("SAFETY_SCORE"), 2) AS Average_Safety_Score
FROM chicago_schools 
    '''
create_pandas_df(query_6, conn)

Unnamed: 0,average_safety_score
0,49.5


#### List the top 5 Community Areas by average College Enrollment

In [15]:
query_7 = '''
SELECT 
    "COMMUNITY_AREA_NAME"
    , ROUND(AVG("COLLEGE_ENROLLMENT"), 3) AS Average_College_Enrollment
FROM chicago_schools 
GROUP BY 1
ORDER BY -AVG("COLLEGE_ENROLLMENT")
LIMIT 5
    '''
create_pandas_df(query_7, conn)

Unnamed: 0,COMMUNITY_AREA_NAME,average_college_enrollment
0,ARCHER HEIGHTS,2411.5
1,MONTCLARE,1317.0
2,WEST ELSDON,1233.333
3,BRIGHTON PARK,1205.875
4,BELMONT CRAGIN,1198.833


#### Use a sub-query to determine which Community Area has the least value for school Safety Score?

In [16]:
query_8 = '''
SELECT "COMMUNITY_AREA_NUMBER"
    , "COMMUNITY_AREA_NAME"
FROM chicago_schools
WHERE "SAFETY_SCORE" = (
    SELECT MIN("SAFETY_SCORE") AS Min_Safety_Score
    FROM chicago_schools
    )
    '''
create_pandas_df(query_8, conn)

Unnamed: 0,COMMUNITY_AREA_NUMBER,COMMUNITY_AREA_NAME
0,40,WASHINGTON PARK


#### Without using an explicit JOIN operator, Find the Per Capita Income of the Community Area which has a school Safety Score of 1

In [17]:
query_9 = '''
SELECT per_capita_income
FROM chicago_census
WHERE community_area_number = (
    SELECT 
        "COMMUNITY_AREA_NUMBER"
    FROM chicago_schools
    WHERE "SAFETY_SCORE" = 1
    )
    '''
create_pandas_df(query_9, conn)

Unnamed: 0,per_capita_income
0,13785
