Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [x] Choose your target. Which column in your tabular dataset will you predict?
- [x] Is your problem regression or classification?
- [x] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [x] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [x] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [x] Begin to clean and explore your data.
- [x] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
# Select and import the raw data into a dataframe
import pandas as pd
import numpy as np

# Import data
df_work = pd.read_csv("/Users/danoand/Documents/Companies/LambdaSchool/Build_Project_02/data/pedestrian-crashes-chapel-hill-region_imported.txt",
                     sep='\t')

df_work.sample(20)


Unnamed: 0,geo_point_2d,Ambulance,City,County,Alcohol Present,Day of Week,CrashGrp,CrashHour,CrashLoc,CrashMonth,...,RdConfig,RdDefects,RdFeature,RdSurface,Region,RuralUrban,SpeedLimit,TraffCntrl,Weather,Workzone
19154,"35.9893707668,-78.3448094397",Yes,None - Rural Crash,Franklin,No,Tuesday,Walking Along Roadway,9,Non-Intersection,September,...,"Two-Way, Not Divided",,No Special Feature,Smooth Asphalt,Piedmont,Rural,40 - 45 MPH,"Double Yellow Line, No Passing Zone",Clear,No
22821,"36.1338066156,-79.4142513504",Yes,None - Rural Crash,Alamance,No,Tuesday,Walking Along Roadway,11,Non-Intersection,May,...,"Two-Way, Not Divided",,No Special Feature,Smooth Asphalt,Piedmont,Rural,30 - 35 MPH,"Double Yellow Line, No Passing Zone",Cloudy,No
31689,"35.4085697892,-80.8627580695",Yes,Huntersville,Mecklenburg,No,Thursday,Other/Unknown—Insufficient Details,12,Non-Intersection,September,...,"Two-Way, Not Divided",,No Special Feature,Coarse Asphalt,Piedmont,Urban,5 - 15 MPH,No Control Present,Cloudy,No
6577,"34.9321022171,-79.7671280096",Yes,Rockingham,Richmond,No,Thursday,Backing Vehicle,10,Non-Roadway,November,...,"Two-Way, Not Divided",,"Driveway, Public",Coarse Asphalt,Piedmont,Urban,Unknown,No Control Present,Cloudy,No
14345,"35.3215453273,-82.4674589497",No,Hendersonville,Henderson,No,Wednesday,Unusual Circumstances,9,Non-Roadway,June,...,"Two-Way, Not Divided",,Missing,Coarse Asphalt,Mountains,Urban,5 - 15 MPH,No Control Present,Clear,No
25165,"35.9639699998,-78.3072",Yes,None - Rural Crash,Franklin,Yes,Sunday,Pedestrian in Roadway—Circumstances Unknown,20,Non-Intersection,October,...,"Two-Way, Not Divided",,No Special Feature,Coarse Asphalt,Piedmont,Rural,50 - 55 MPH,No Control Present,Clear,No
4980,"35.2295432357,-80.9247151697",Yes,Charlotte,Mecklenburg,Yes,Sunday,Crossing Roadway—Vehicle Not Turning,4,Intersection,March,...,"Two-Way, Divided, Positive Median Barrier",,Four-Way Intersection,Smooth Asphalt,Piedmont,Urban,50 - 55 MPH,Stop And Go Signal,Clear,No
1229,"35.5289200002,-82.9418",Yes,None - Rural Crash,Haywood,No,Tuesday,Walking Along Roadway,5,Non-Intersection,July,...,"Two-Way, Not Divided",,No Special Feature,Coarse Asphalt,Mountains,Rural,40 - 45 MPH,"Double Yellow Line, No Passing Zone",Clear,No
31721,"34.2460800003,-77.8540999997",No,Wilmington,New Hanover,No,Thursday,Unusual Circumstances,18,Non-Intersection,April,...,"Two-Way, Divided, Unprotected Median",,No Special Feature,Coarse Asphalt,Coastal,Urban,Unknown,No Control Present,Clear,No
23679,"35.0620500003,-80.6357999998",No,Indian Trail,Union,No,Wednesday,Backing Vehicle,6,Non-Roadway,March,...,"Two-Way, Not Divided",,"Driveway, Public",Smooth Asphalt,Piedmont,Urban,Unknown,No Control Present,"Fog, Smog, Smoke",No


### Choose Target

_Which column in your tablular dataset will you predict?_ 

From a a business/policy perspective I am going to predict the pedestrian injury column.  Specifically, an engineered column based on the `PedInjury` attribute and predict whether an outcome of a fatality or serious injury has occurred.



### Type of Problem

_Is your problem regression or classification?_ 

`Classification`

### Target Distribution

_How many classes? Are the classes imbalanced?_

There are six classes of which the engineered target column will combine or treat two classes as one predicted outcome (***A: Suspected Serious Injury*** & ***K: Killed***)

```
C: Possible Injury             0.409020
B: Suspected Minor Injury      0.354597
A: Suspected Serious Injury    0.072140
K: Killed                      0.063917
O: No Injury                   0.059975
Unknown Injury                 0.040351
Name: PedInjury, dtype: float64
```

In this training dataset, approximately 12% of pedestrians involved in vehicle/pedestrian accidents were fatalities or suffered a serious injury


### Evaluation Metrics

I will use accuracy, precision, and recall to get an overall view of the model's performance


### Observation Choices

_Choose which observations you will use to train, validate, and test your model._

#### Dataset Assessment

The overall dataset appears very "clean".  No missing data or outliers identified.  

* For the model attributes (subset of all of the dataset attributes), there is some high cardinality and zero values that align with the real world observations (e.g. 0 hour assume to mean the Midnight to 1am hour)

**Pandas profile assessment**:

* `City` has a high cardinality: 444 distinct values	(Warning)
* `County` has a high cardinality: 100 distinct values	(Warning)
* `CrashHour` has 716 (2.1%) zeros	(Zeros)
* `Region` is highly correlated with `County`	(High Correlation)
* `County` is highly correlated with `Region`	(High Correlation)

#### Modeling Split

My intent is to split the data randomly but "stratify" along the original (pre-engineered) Pedestrian injury outcomes.  The split will be as follows:

* Training (`70%`)
* Validates (`15%`)
* Test (`15%`)

#### Modeling Attributes

The modeling attributes are "a priori" or "ante" data elements that could reasonably be known prior to the event.  (As opposed to attributes known only after the event such as driver age, sex, etc.).  Those "ante" data elements are:

* `City`
* `County`
* `CrashHour`
* `CrashMonth`
* `Development`
* `LightCond`
* `Locality`
* `NumLanes`
* `RdCharacte`
* `RdClass`
* `RdConditio`
* `RdConfig`
* `RdDefects`
* `RdFeature`
* `RdSurface`
* `Region`
* `RuralUrban`
* `SpeedLimit`
* `TraffCntrl`
* `Weather`
* `Workzone`
