# <font color=blue>  COVID analysis for neighbourhoods in New York city </font>

## Problem Statement : 
Is there a correlation between various population demographic factors and COVID infection rate spreads in New York city ? Can this be proven based on data correlations and visualisations?

## Background: 
COVID has spread across the globe. The infections rates and mortality rates are different in diff places. For this exercise we will analyse covid infection rates and find correlation with population demographics like race and income groups.

## Data Description
### For this analysis I am listing down the sources and types of data I would be using
1. CDC COVID data tabled neighbourhood wise
2. NYC municipal data in geoJson format detailing the neighbourhood geo data.
3. NYC census data for various demographics like race, sex, income groups.

I would be doing a minor cleanup since the data is from multiple different sources. Key here would be the Modified CTA is a modified ZIP code data

https://github.com/nychealth/coronavirus-data

## Methodology
### Import the relevant libraries for analysis

In [None]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
import folium

#### Download NY neighbourhood wise Covid case rate data to a dataframe 

In [None]:
url = 'https://github.com/nychealth/coronavirus-data/blob/master/data-by-modzcta.csv'
webList = pd.read_html(url, header=0)
dfByLocation=webList[0]
dfByLocation.drop(columns = ['Unnamed: 0'],axis=1,inplace=True)
dfByLocation.head()

#### Modify dataframe for the required columns

In [184]:
dfByLocationTrim=dfByLocation[['MODIFIED_ZCTA','NEIGHBORHOOD_NAME','BOROUGH_GROUP','COVID_CASE_RATE']]

#### MODIFIED_ZCTA column has the key to identify the nighbourhood in the json file.

In [None]:
dfByLocationTrim.loc[:,'MODIFIED_ZCTA']= dfByLocationTrim[['MODIFIED_ZCTA']].astype("str")
dfByLocationTrim.head()

In [None]:
filename= "Demographic_Statistics_By_Zip_Code.csv"

In [178]:
dfDemographic = pd.read_csv(filename)
dfDemographic.head()

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
0,10001,44,22,0.5,22,0.5,0,0,44,100,...,44,100,20,0.45,24,0.55,0,0,44,100
1,10002,35,19,0.54,16,0.46,0,0,35,100,...,35,100,2,0.06,33,0.94,0,0,35,100
2,10003,1,1,1.0,0,0.0,0,0,1,100,...,1,100,0,0.0,1,1.0,0,0,1,100
3,10004,0,0,0.0,0,0.0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0
4,10005,2,2,1.0,0,0.0,0,0,2,100,...,2,100,0,0.0,2,1.0,0,0,2,100


In [179]:
dfDemographic.columns

Index(['JURISDICTION NAME', 'COUNT PARTICIPANTS', 'COUNT FEMALE',
       'PERCENT FEMALE', 'COUNT MALE', 'PERCENT MALE', 'COUNT GENDER UNKNOWN',
       'PERCENT GENDER UNKNOWN', 'COUNT GENDER TOTAL', 'PERCENT GENDER TOTAL',
       'COUNT PACIFIC ISLANDER', 'PERCENT PACIFIC ISLANDER',
       'COUNT HISPANIC LATINO', 'PERCENT HISPANIC LATINO',
       'COUNT AMERICAN INDIAN', 'PERCENT AMERICAN INDIAN',
       'COUNT ASIAN NON HISPANIC', 'PERCENT ASIAN NON HISPANIC',
       'COUNT WHITE NON HISPANIC', 'PERCENT WHITE NON HISPANIC',
       'COUNT BLACK NON HISPANIC', 'PERCENT BLACK NON HISPANIC',
       'COUNT OTHER ETHNICITY', 'PERCENT OTHER ETHNICITY',
       'COUNT ETHNICITY UNKNOWN', 'PERCENT ETHNICITY UNKNOWN',
       'COUNT ETHNICITY TOTAL', 'PERCENT ETHNICITY TOTAL',
       'COUNT PERMANENT RESIDENT ALIEN', 'PERCENT PERMANENT RESIDENT ALIEN',
       'COUNT US CITIZEN', 'PERCENT US CITIZEN', 'COUNT OTHER CITIZEN STATUS',
       'PERCENT OTHER CITIZEN STATUS', 'COUNT CITIZEN STATUS UNKN

In [180]:
dfDemographicRace.loc[:,'Coloured'] = dfDemographic['PERCENT PACIFIC ISLANDER']+dfDemographic['PERCENT HISPANIC LATINO']+dfDemographic['PERCENT AMERICAN INDIAN']+dfDemographic['PERCENT ASIAN NON HISPANIC']+dfDemographic['PERCENT BLACK NON HISPANIC']

NameError: name 'dfDemographicRace' is not defined

In [181]:
dfDemographicRace.loc[:,'MODIFIED_ZCTA'] = dfDemographic['JURISDICTION NAME']

NameError: name 'dfDemographicRace' is not defined

In [119]:
dfDemographicRace[['MODIFIED_ZCTA']]= dfDemographicMod[['MODIFIED_ZCTA']].astype("str")

In [120]:
dfDemographicRace.head()

Unnamed: 0,Coloured,MODIFIED_ZCTA
0,0.91,10001
1,0.83,10002
2,1.0,10003
3,0.0,10004
4,1.0,10005


## Showcasing a few properties of covid spread via visualisation

#### Download the NY geoJson data file

In [121]:
!wget --quiet https://raw.githubusercontent.com/nychealth/coronavirus-data/master/Geography-resources/MODZCTA_2010_WGS1984.geo.json -O nygeo.json

### Creating a Folium Choropeth layer for Covid infection rate (scale - Yellow ---> Red)

In [173]:
# Setting threshold for the scale
threshold_scale = np.linspace(dfByLocationTrim['COVID_CASE_RATE'].min(),
                              dfByLocationTrim['COVID_CASE_RATE'].max(),
                              6, dtype=int)
threshold_scale = threshold_scale.tolist() 
threshold_scale[-1] = threshold_scale[-1] + 1 

#Create Visualisation map for Covid infection rate
nygeo = r'nygeo.json' # geojson file
ny_map = folium.Map(location=[40.730610, -73.935242], zoom_start=10)
# generate choropleth map using the COVID infection rates
covid_rate = ny_map.choropleth(
    geo_data=nygeo,
    data=dfByLocationTrim,
    columns=['MODIFIED_ZCTA', 'COVID_CASE_RATE'],
    key_on='feature.properties.MODZCTA',
    threshold_scale=threshold_scale,
    fill_color='YlGn', 
    name='Covid Rate',
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Covid cases per 100,000',
    reset = True
)

### Creating a Folium Choropeth layer for neighbourhoods with population of colour. (scale - Light Blue ---> Deep Purple \m/ )

#### The two layers are Blended together and visually show a correlation between demographic and the covid infection rate. 
#### Deeper RED + Blue shows magenta and therefore is a combination of high covid infection rate and high percentage of population of colour.
#### The layer selector can be used to switch between the layers on the choropleth maps.

In [174]:
# generate choropleth map using the COVID infection rates
ny_map.choropleth(
    geo_data=nygeo,
    data=dfDemographicMod,
    columns=['MODIFIED_ZCTA', 'Coloured'],
    key_on='feature.properties.MODZCTA',
    #threshold_scale=threshold_scale,
    fill_color='BuGn', 
    name='Coloured Population',
    fill_opacity=0.4, 
    line_opacity=0.1,
    legend_name='Covid cases per 100,000',
    reset = True
)
folium.LayerControl().add_to(ny_map)
# display map
ny_map