<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/DS-Unit-2-Applied-Modeling/blob/master/db_LS_DS_232_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/Daniel-Benson-Poe/practice_datasets/master/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Read in data
import pandas as pd
suicide_df = pd.read_csv(DATA_PATH+'suicide_rates.csv')

In [15]:
suicide_df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [16]:
# Change column names: replace spaces with underscores
cols_to_name = ['suicides_no', 'suicides/100k pop', 'country-year', 'HDI for year', ' gdp_for_year ($) ', 'gdp_per_capita ($)']
new_col_names = ['num_suicides', 'suicides/100k_pop','country_year', 'HDI_for_year', 'annual_gdp', 'gdp_per_capita']
i = 0
for col in cols_to_name:
  suicide_df = suicide_df.rename(columns={col: new_col_names[i]})
  i += 1
suicide_df.head()

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,country_year,HDI_for_year,annual_gdp,gdp_per_capita,generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [10]:
# Get Pandas Profiling Report
# Check Pandas Profiling version
import pandas_profiling

pandas_profiling.__version__

'2.6.0'

In [17]:
# New code for Pandas Profiling version 2.4
from pandas_profiling import ProfileReport
profile = ProfileReport(suicide_df, minimal=True).to_notebook_iframe()

profile

HBox(children=(FloatProgress(value=0.0, description='variables', max=12.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…









HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…




In [18]:
# The Pandas Profiling Report showed that annual_gdp was read as objects
# Convert it to integers

# Start by creating function for removing characters in column values
def char_eraser(df, column, chars):
  df = df[column].replace(chars, '')
  return df

# Remove commas from the values in annual_gdp column and convert the values to integers
suicide_df['annual_gdp'] = suicide_df.apply(char_eraser, axis=1, args=('annual_gdp', ',')).astype(int)

# Check that it worked
print(suicide_df['annual_gdp'])
print(suicide_df['annual_gdp'].dtype)

0         2156624900
1         2156624900
2         2156624900
3         2156624900
4         2156624900
            ...     
27815    63067077179
27816    63067077179
27817    63067077179
27818    63067077179
27819    63067077179
Name: annual_gdp, Length: 27820, dtype: int64
int64


In [27]:
# The Pandas Profiling Report also showed high cardinality for the country_year column
# as well as a very large number of missing values for the HDI_for_year column

# Look into the country_year column
suicide_df['country_year'].value_counts(dropna=False)

Sweden2000                 12
Argentina2008              12
Slovenia2012               12
Trinidad and Tobago1998    12
Greece2014                 12
                           ..
Sweden2016                 10
Mauritius2016              10
Romania2016                10
Thailand2016               10
Hungary2016                10
Name: country_year, Length: 2321, dtype: int64

In [28]:
# Look into HDI_for_year column
suicide_df['HDI_for_year'].value_counts(dropna=False)

NaN      19456
0.772       84
0.713       84
0.888       84
0.909       72
         ...  
0.765       12
0.522       12
0.728       12
0.879       12
0.669       12
Name: HDI_for_year, Length: 306, dtype: int64

In [29]:
# Drop the country_year and HDI_for_year columns
garbage_columns = ['country_year', 'HDI_for_year']
suicide_df = suicide_df.drop(columns=garbage_columns)
suicide_df.head()

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
0,Albania,1987,male,15-24 years,21,312900,6.71,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,2156624900,796,Boomers


In [30]:
# Remove the ' years' string in the age column
suicide_df['age'] = suicide_df.apply(char_eraser, axis=1, args=('age', ' years'))
suicide_df.head()

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
0,Albania,1987,male,15-24,21,312900,6.71,2156624900,796,Generation X
1,Albania,1987,male,35-54,16,308000,5.19,2156624900,796,Silent
2,Albania,1987,female,15-24,14,289700,4.83,2156624900,796,Generation X
3,Albania,1987,male,75+,1,21800,4.59,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34,9,274300,3.28,2156624900,796,Boomers


In [24]:
# Q. What is the maximum num_suicides in this dataset?
print(f"Maximum num_suicides: {suicide_df['num_suicides'].describe()[-1]}")

Maximum num_suicides: 22338.0


In [25]:
# Look at the row with the max num_suicides
suicide_df[suicide_df['num_suicides'] == suicide_df['num_suicides'].describe()[-1]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,country_year,HDI_for_year,annual_gdp,gdp_per_capita,generation
20996,Russian Federation,1994,male,35-54 years,22338,19044200,117.3,Russian Federation1994,,395077301248,2853,Boomers


In [33]:
# What about the minimum num_suicides?
print(f"Minimum num_suicides: {suicide_df['num_suicides'].describe()[3]}")

Minimum num_suicides: 0.0


In [34]:
# Look at the row/s with the minimum num_suicides
suicide_df[suicide_df['num_suicides'] == suicide_df['num_suicides'].describe()[3]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
9,Albania,1987,female,5-14,0,311000,0.0,2156624900,796,Generation X
10,Albania,1987,female,55-74,0,144600,0.0,2156624900,796,G.I. Generation
11,Albania,1987,male,5-14,0,338200,0.0,2156624900,796,Generation X
22,Albania,1988,female,5-14,0,317200,0.0,2126000000,769,Generation X
23,Albania,1988,male,5-14,0,345000,0.0,2126000000,769,Generation X
...,...,...,...,...,...,...,...,...,...,...
27363,Uruguay,1998,female,5-14,0,262973,0.0,25385928198,8420,Millenials
27459,Uruguay,2006,female,5-14,0,260187,0.0,19579457966,6362,Millenials
27471,Uruguay,2007,female,5-14,0,257931,0.0,23410572634,7581,Generation Z
27495,Uruguay,2009,male,5-14,0,263516,0.0,31660911277,10166,Generation Z


In [38]:
# Q. Now what is the max population? How many suicides does it have?
print(f"Maximum Population: {suicide_df['population'].describe()[-1]}")

Maximum Population: 43805214.0


In [37]:
# Look at the row with the max population
suicide_df[suicide_df['population'] == suicide_df['population'].describe()[-1]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
27105,United States,2006,female,35-54,3376,43805214,7.71,13855888000000,49666,Boomers


In [39]:
# What is the min population? How many suicides does it have?
print(f"Minimum Population: {suicide_df['population'].describe()[3]}")

Minimum Population: 278.0


In [41]:
# Look at the row with the minimum population
suicide_df[suicide_df['population'] == suicide_df['population'].describe()[3]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
14059,Kiribati,1991,male,75+,0,278,0.0,47515189,768,G.I. Generation
14167,Kiribati,2000,male,75+,0,278,0.0,67254174,928,G.I. Generation


In [45]:
# Q. How often did 0 suicides occur in this data?
suicide_df[suicide_df['num_suicides'] == 0].shape, suicide_df.shape

((4281, 10), (27820, 10))

In [47]:
# Look at suicides for > ? population
suicide_df['population'].mean(), suicide_df['population'].median()

(1844793.6173975556, 430150.0)

In [48]:
# Look at suicides for > mean population
suicide_df[suicide_df['population'] > suicide_df['population'].mean()]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
589,Argentina,1985,male,55-74,485,1997000,24.29,88416668900,3264,G.I. Generation
590,Argentina,1985,male,35-54,414,3346300,12.37,88416668900,3264,Silent
591,Argentina,1985,female,55-74,210,2304000,9.11,88416668900,3264,G.I. Generation
592,Argentina,1985,male,25-34,177,2234200,7.92,88416668900,3264,Boomers
594,Argentina,1985,male,15-24,156,2415200,6.46,88416668900,3264,Generation X
...,...,...,...,...,...,...,...,...,...,...
27812,Uzbekistan,2014,male,15-24,347,3126905,11.10,63067077179,2309,Millenials
27814,Uzbekistan,2014,female,25-34,162,2735238,5.92,63067077179,2309,Millenials
27815,Uzbekistan,2014,female,35-54,107,3620833,2.96,63067077179,2309,Generation X
27817,Uzbekistan,2014,male,5-14,60,2762158,2.17,63067077179,2309,Generation Z


In [49]:
# Look at suicides for > median population
suicide_df[suicide_df['population'] > suicide_df['population'].median()]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
589,Argentina,1985,male,55-74,485,1997000,24.29,88416668900,3264,G.I. Generation
590,Argentina,1985,male,35-54,414,3346300,12.37,88416668900,3264,Silent
591,Argentina,1985,female,55-74,210,2304000,9.11,88416668900,3264,G.I. Generation
592,Argentina,1985,male,25-34,177,2234200,7.92,88416668900,3264,Boomers
593,Argentina,1985,female,75+,41,537000,7.64,88416668900,3264,G.I. Generation
...,...,...,...,...,...,...,...,...,...,...
27814,Uzbekistan,2014,female,25-34,162,2735238,5.92,63067077179,2309,Millenials
27815,Uzbekistan,2014,female,35-54,107,3620833,2.96,63067077179,2309,Generation X
27817,Uzbekistan,2014,male,5-14,60,2762158,2.17,63067077179,2309,Generation Z
27818,Uzbekistan,2014,female,5-14,44,2631600,1.67,63067077179,2309,Generation Z


In [52]:
# What are the sex categories?
# How frequently does each occur?
print(suicide_df['sex'].value_counts())

# What are the generation categories?
# How frequently do they occur?
print(suicide_df['generation'].value_counts())

male      13910
female    13910
Name: sex, dtype: int64
Generation X       6408
Silent             6364
Millenials         5844
Boomers            4990
G.I. Generation    2744
Generation Z       1470
Name: generation, dtype: int64


In [55]:
# Look at annual_gdp
# Are there any zeros in annual_gdp? Is there supposed to be?
suicide_df['annual_gdp'].describe()


count    2.782000e+04
mean     4.455810e+11
std      1.453610e+12
min      4.691962e+07
25%      8.985353e+09
50%      4.811469e+10
75%      2.602024e+11
max      1.812071e+13
Name: annual_gdp, dtype: float64

In [60]:
# Look into min annual_gdp
suicide_df[suicide_df['annual_gdp'] == suicide_df['annual_gdp'].describe()[3]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
14072,Kiribati,1993,male,35-54,3,6370,47.1,46919625,735,Boomers
14073,Kiribati,1993,male,15-24,3,6585,45.56,46919625,735,Generation X
14074,Kiribati,1993,male,55-74,1,2384,41.95,46919625,735,Silent
14075,Kiribati,1993,male,25-34,2,5958,33.57,46919625,735,Boomers
14076,Kiribati,1993,female,15-24,1,6579,15.2,46919625,735,Generation X
14077,Kiribati,1993,female,5-14,1,9415,10.62,46919625,735,Millenials
14078,Kiribati,1993,male,5-14,1,9780,10.22,46919625,735,Millenials
14079,Kiribati,1993,female,25-34,0,6436,0.0,46919625,735,Boomers
14080,Kiribati,1993,female,35-54,0,6730,0.0,46919625,735,Boomers
14081,Kiribati,1993,female,55-74,0,2840,0.0,46919625,735,Silent


In [61]:
# Look into max annual gdp
suicide_df[suicide_df['annual_gdp'] == suicide_df['annual_gdp'].describe()[-1]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
27208,United States,2015,male,75+,3171,8171136,38.81,18120714000000,60387,Silent
27209,United States,2015,male,55-74,9068,32264697,28.11,18120714000000,60387,Boomers
27210,United States,2015,male,35-54,11634,41658010,27.93,18120714000000,60387,Generation X
27211,United States,2015,male,25-34,5503,22137097,24.86,18120714000000,60387,Millenials
27212,United States,2015,male,15-24,4359,22615073,19.27,18120714000000,60387,Millenials
27213,United States,2015,female,35-54,4053,41531809,9.76,18120714000000,60387,Generation X
27214,United States,2015,female,55-74,2872,35115610,8.18,18120714000000,60387,Boomers
27215,United States,2015,female,25-34,1444,21555712,6.7,18120714000000,60387,Millenials
27216,United States,2015,female,15-24,1132,21633813,5.23,18120714000000,60387,Millenials
27217,United States,2015,female,75+,540,11778666,4.58,18120714000000,60387,Silent


In [56]:
# Look at gdp_per_capita
# Are there any zeroz? Is there supposed to be?
suicide_df['gdp_per_capita'].describe()

count     27820.000000
mean      16866.464414
std       18887.576472
min         251.000000
25%        3447.000000
50%        9372.000000
75%       24874.000000
max      126352.000000
Name: gdp_per_capita, dtype: float64

In [58]:
# Look into min gdp_per_capita
suicide_df[suicide_df['gdp_per_capita'] == suicide_df['gdp_per_capita'].describe()[3]]


Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
36,Albania,1992,male,35-54,12,343800,3.49,709452584,251,Boomers
37,Albania,1992,male,15-24,9,263700,3.41,709452584,251,Generation X
38,Albania,1992,male,55-74,5,159500,3.13,709452584,251,Silent
39,Albania,1992,male,25-34,7,245500,2.85,709452584,251,Boomers
40,Albania,1992,female,15-24,7,292400,2.39,709452584,251,Generation X
41,Albania,1992,female,25-34,4,267400,1.5,709452584,251,Boomers
42,Albania,1992,female,35-54,2,323100,0.62,709452584,251,Boomers
43,Albania,1992,female,55-74,1,164900,0.61,709452584,251,Silent
44,Albania,1992,female,5-14,0,336700,0.0,709452584,251,Millenials
45,Albania,1992,female,75+,0,38700,0.0,709452584,251,G.I. Generation


In [59]:
# Look into max gdp_per_capita
suicide_df[suicide_df['gdp_per_capita'] == suicide_df['gdp_per_capita'].describe()[-1]]

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation
15654,Luxembourg,2014,male,55-74,18,52295,34.42,66327344189,126352,Boomers
15655,Luxembourg,2014,male,75+,4,14546,27.5,66327344189,126352,Silent
15656,Luxembourg,2014,male,35-54,20,88218,22.67,66327344189,126352,Generation X
15657,Luxembourg,2014,male,25-34,7,41442,16.89,66327344189,126352,Millenials
15658,Luxembourg,2014,female,75+,3,22669,13.23,66327344189,126352,Silent
15659,Luxembourg,2014,female,55-74,5,52260,9.57,66327344189,126352,Boomers
15660,Luxembourg,2014,female,15-24,2,32510,6.15,66327344189,126352,Millenials
15661,Luxembourg,2014,male,15-24,2,34219,5.84,66327344189,126352,Millenials
15662,Luxembourg,2014,female,25-34,2,40862,4.89,66327344189,126352,Millenials
15663,Luxembourg,2014,female,35-54,3,84147,3.57,66327344189,126352,Generation X


In [62]:
# Keep subset of rows:
# Suicides more than 0, 
suicide_df_sub = suicide_df.copy()
print(suicide_df_sub.shape)
suicide_df_sub = suicide_df_sub[suicide_df_sub['num_suicides'] > 0]
# Check how many rows you have now.
suicide_df_sub.shape

(27820, 10)


(23539, 10)

In [74]:
# Make a Plotly Express scatter plot of annual_gdp vs num_suicides
import plotly.express as px

px.scatter(suicide_df, x='annual_gdp', y='num_suicides')

In [75]:
# Create the same plot using our subset of data (excluding suicide numbers equal to 0)
px.scatter(suicide_df_sub, x='annual_gdp', y='num_suicides')

In [76]:
# Make a Plotly Express scatter plot of gdp_per_capita vs num_suicides
px.scatter(suicide_df, x='gdp_per_capita', y='num_suicides')

In [77]:
# Create same plot using subset of data (excluding suicide numbers equal to 0)
px.scatter(suicide_df_sub, x='gdp_per_capita', y='num_suicides')

In [0]:
# Add an OLS (Ordinary Least Squares) trendline,
# to see how the outliers influence the "line of best fit"


In [0]:
# Look at some of the top suicide numbers
# Where do they occur?

# Look at some of the top lowest suicide numbers
# Where do they occur?

In [0]:
# Make a judgment call:
# Are there outliers? 
# Should they be removed?

In [0]:
# Now that you've removed outliers,
# Look again at a scatter plot with OLS (Ordinary Least Squares) trendline


In [0]:
# Select these columns, then write to a csv file named suicides.csv. Don't include the index.
