<a href="https://colab.research.google.com/github/jdz014/DS-Unit-2-Applied-Modeling/blob/master/module1-define-ml-problems/LS_DS12_231_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [145]:
!wget 'https://raw.githubusercontent.com/washingtonpost/data-school-shootings/master/school-shootings-data.csv'

--2020-02-25 01:21:06--  https://raw.githubusercontent.com/washingtonpost/data-school-shootings/master/school-shootings-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 70530 (69K) [text/plain]
Saving to: ‘school-shootings-data.csv.3’


2020-02-25 01:21:06 (3.02 MB/s) - ‘school-shootings-data.csv.3’ saved [70530/70530]



In [198]:
import pandas as pd

df = pd.read_csv('school-shootings-data.csv')
df.head()

Unnamed: 0,uid,nces_school_id,school_name,nces_district_id,district_name,date,school_year,year,time,day_of_week,city,state,school_type,enrollment,killed,injured,casualties,shooting_type,age_shooter1,gender_shooter1,race_ethnicity_shooter1,shooter_relationship1,shooter_deceased1,deceased_notes1,age_shooter2,gender_shooter2,race_ethnicity_shooter2,shooter_relationship2,shooter_deceased2,deceased_notes2,white,black,hispanic,asian,american_indian_alaska_native,hawaiian_native_pacific_islander,two_or_more,resource_officer,weapon,weapon_source,lat,long,staffing,low_grade,high_grade,lunch,county,state_fips,county_fips,ulocale
0,1,80480000707,Columbine High School,804800.0,Jefferson County R-1,4/20/1999,1998-1999,1999,11:19 AM,Tuesday,Littleton,Colorado,public,1965,13,21,34,indiscriminate,18.0,m,w,student,1.0,suicide,17.0,m,w,student,1.0,suicide,1783,16.0,112.0,42.0,12.0,,,1,12-gauge Savage-Springfield 67H pump-action sh...,purchased from friends,39.60391,-105.075,89.6,9,12,41.0,Jefferson County,8,8059,21.0
1,2,220054000422,Scotlandville Middle School,2200540.0,East Baton Rouge Parish School Board,4/22/1999,1998-1999,1999,12:30 PM,Thursday,Baton Rouge,Louisiana,public,588,0,1,1,targeted,14.0,m,,former student (expelled),0.0,,,,,,,,5,583.0,0.0,0.0,0.0,,,0,.22-caliber handgun,,30.529958,-91.169966,39.0,6,8,495.0,East Baton Rouge Parish,22,22033,12.0
2,3,130441001591,Heritage High School,1304410.0,Rockdale County,5/20/1999,1998-1999,1999,8:03 AM,Thursday,Conyers,Georgia,public,1369,0,6,6,indiscriminate,15.0,m,w,student,0.0,,,,,,,,1189,136.0,28.0,15.0,1.0,,,1,".22-caliber rifle, .357-caliber Magnum handgun",,33.626922,-84.04796,84.0,9,12,125.0,Rockdale County,13,13247,21.0
3,4,421899003847,John Bartram High School,4218990.0,Philadelphia City SD,10/4/1999,1999-2000,1999,10:00 AM,Monday,Philadelphia,Pennsylvania,public,3147,0,1,1,targeted,17.0,m,,student,0.0,,,,,,,,209,2736.0,27.0,170.0,5.0,,,1,RG .25-caliber handgun,purchased from friend,39.921509,-75.234108,41.0,9,12,2007.0,Philadelphia County,42,42101,11.0
4,5,250279000225,Dorchester High School,2502790.0,Boston,11/3/1999,1999-2000,1999,7:40 AM,Wednesday,Boston,Massachusetts,public,1116,0,1,1,targeted,,m,,,0.0,,,,,,,,40,755.0,287.0,29.0,5.0,,,0,,,42.285268,-71.075901,,9,12,543.0,Suffolk County,25,25025,11.0


In [0]:
# Replace shooting type with 'other' for rows not 'targeted' or 'indiscriminate'
df['shooting_type'] = df['shooting_type'].replace(['accidental', 'unclear',
                                                   'targeted and indiscriminate',
                                                   'public suicide',
                                                   'hostage suicide',
                                                   'accidental or targeted',
                                                   'public suicide (attempted)'],
                                                  'other')
# Fill missing value with 'other'
df['shooting_type'] = df['shooting_type'].fillna('other')

In [200]:
# Majority class baseline 59%
df['shooting_type'].value_counts(normalize=True)

targeted          0.588235
other             0.222689
indiscriminate    0.189076
Name: shooting_type, dtype: float64

In [0]:
# Remove commas from numbers
df['white'] = df['white'].str.replace(",", "")

# Change from object to int
df['white'] = pd.to_numeric(df['white'])

In [0]:
# Remove commas from numbers
df['enrollment'] = df['enrollment'].str.replace(",", "")

# Change from object to int
df['enrollment'] = pd.to_numeric(df['enrollment'])

In [0]:
# The long way
# df['white'] = df['white'].fillna(0)
# df['black'] = df['black'].fillna(0)
# df['hispanic'] = df['hispanic'].fillna(0)
# df['asian'] = df['asian'].fillna(0)
# df['american_indian_alaska_native'] = df['american_indian_alaska_native'].fillna(0)
# df['hawaiian_native_pacific_islander'] = df['hawaiian_native_pacific_islander'].fillna(0)
# df['two_or_more'] = df['two_or_more'].fillna(0)

# Fill missing values with 0 for these specific columns
df.fillna({'white': 0, 'black': 0, 'hispanic': 0, 'asian': 0,
           'american_indian_alaska_native': 0,
           'hawaiian_native_pacific_islander': 0, 'two_or_more': 0}, inplace=True)

In [0]:
# Drop columns with 200+ missing values
df = df.drop(columns=['deceased_notes1', 'age_shooter2', 'gender_shooter2', 
                 'race_ethnicity_shooter2', 'shooter_relationship2', 
                 'shooter_deceased2', 'deceased_notes2'])

# Drop unusable variance 
df = df.drop(columns=['uid', 'nces_school_id', 'nces_district_id', 'weapon', 
                      'weapon_source', 'state_fips', 'county_fips', 'ulocale',
                      'lunch', 'age_shooter1', 'gender_shooter1',
                      'race_ethnicity_shooter1', 'shooter_relationship1',
                      'shooter_deceased1'])

In [0]:
# Change date and time to datettime
df['date'] = pd.to_datetime(df['date'])

df['time'] = pd.to_datetime(df['time'])

# Apply only the time to column (no date)
df['time'] = df['time'].dt.time 

In [212]:
print(df.shape)
df.head()

(238, 29)


Unnamed: 0,school_name,district_name,date,school_year,year,time,day_of_week,city,state,school_type,enrollment,killed,injured,casualties,shooting_type,white,black,hispanic,asian,american_indian_alaska_native,hawaiian_native_pacific_islander,two_or_more,resource_officer,lat,long,staffing,low_grade,high_grade,county
0,Columbine High School,Jefferson County R-1,1999-04-20,1998-1999,1999,11:19:00,Tuesday,Littleton,Colorado,public,1965,13,21,34,indiscriminate,1783.0,16.0,112.0,42.0,12.0,0.0,0.0,1,39.60391,-105.075,89.6,9,12,Jefferson County
1,Scotlandville Middle School,East Baton Rouge Parish School Board,1999-04-22,1998-1999,1999,12:30:00,Thursday,Baton Rouge,Louisiana,public,588,0,1,1,targeted,5.0,583.0,0.0,0.0,0.0,0.0,0.0,0,30.529958,-91.169966,39.0,6,8,East Baton Rouge Parish
2,Heritage High School,Rockdale County,1999-05-20,1998-1999,1999,08:03:00,Thursday,Conyers,Georgia,public,1369,0,6,6,indiscriminate,1189.0,136.0,28.0,15.0,1.0,0.0,0.0,1,33.626922,-84.04796,84.0,9,12,Rockdale County
3,John Bartram High School,Philadelphia City SD,1999-10-04,1999-2000,1999,10:00:00,Monday,Philadelphia,Pennsylvania,public,3147,0,1,1,targeted,209.0,2736.0,27.0,170.0,5.0,0.0,0.0,1,39.921509,-75.234108,41.0,9,12,Philadelphia County
4,Dorchester High School,Boston,1999-11-03,1999-2000,1999,07:40:00,Wednesday,Boston,Massachusetts,public,1116,0,1,1,targeted,40.0,755.0,287.0,29.0,5.0,0.0,0.0,0,42.285268,-71.075901,,9,12,Suffolk County


In [208]:
df.isnull().sum()

school_name                          0
district_name                       12
date                                 0
school_year                          0
year                                 0
time                                11
day_of_week                          0
city                                 0
state                                0
school_type                          0
enrollment                           0
killed                               0
injured                              0
casualties                           0
shooting_type                        0
white                                0
black                                0
hispanic                             0
asian                                0
american_indian_alaska_native        0
hawaiian_native_pacific_islander     0
two_or_more                          0
resource_officer                     0
lat                                  1
long                                 1
staffing                 

In [209]:
df.dtypes

school_name                                 object
district_name                               object
date                                datetime64[ns]
school_year                                 object
year                                         int64
time                                        object
day_of_week                                 object
city                                        object
state                                       object
school_type                                 object
enrollment                                   int64
killed                                       int64
injured                                      int64
casualties                                   int64
shooting_type                               object
white                                      float64
black                                      float64
hispanic                                   float64
asian                                      float64
american_indian_alaska_native  

In [210]:
df.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq,first,last
school_name,238,231,Central High School,3,NaT,NaT
district_name,226,195,Los Angeles Unified,6,NaT,NaT
date,238,236,2014-09-30 00:00:00,2,1999-04-20,2019-04-03
school_year,238,21,2017-2018,25,NaT,NaT
time,227,103,12:00:00,15,NaT,NaT
day_of_week,238,5,Tuesday,55,NaT,NaT
city,238,196,Philadelphia,5,NaT,NaT
state,238,44,California,28,NaT,NaT
school_type,238,2,public,226,NaT,NaT
shooting_type,238,3,targeted,140,NaT,NaT


In [195]:
df['low_grade'].value_counts()

9     135
6      27
KG     24
PK     19
7      16
8      10
5       2
4       1
3       1
11      1
Name: low_grade, dtype: int64