<a href="https://colab.research.google.com/github/Dlinz1/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/DlinzPT9Make_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 1, Sprint 1, Module 2*

---

# Make Features 

- Student should be able to understand the purpose of feature engineering
- Student should be able to work with strings in pandas
- Student should be able to work with dates and times in pandas
- Student should be able to modify or create columns of a dataframe using the `.apply()` function


Helpful Links:
- [Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series
- [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)

# [Objective](#feature-engineering) - The Purpose of Feature Engineering



## Overview

Feature Engineering is the process of using a combination of domain knowledge, creativity and the pre-existing columns of a dataset to create completely new columns.

 Machine Learning models try to detect patterns in the data and then associate those patterns with certain predictions. The hope is that by creating new columns on our dataset that we can expose our model to new patterns in the data so that it can make better and better predictions.

This is largely a matter of understanding how to work with individual columns of a dataframe with Pandas --which is what we'll be practicing today!

## Follow Along

Columns of a dataframe hold each hold a specific type of data. Lets inspect some of the common datatypes found in datasets and then we'll make a new feature on a dataset using pre-existing columns.

In [None]:
#How to change order of columns
frame = pd.DataFrame({'one thing':[1,2,3,4],'second thing':[0.1,0.2,1,2],'other thing':['a','e','i','o']})

frame

frame = frame[['second thing', 'other thing', 'one thing']]

frame


Unnamed: 0,second thing,other thing,one thing
0,0.1,a,1
1,0.2,e,2
2,1.0,i,3
3,2.0,o,4


In [None]:
import pandas as pd
import numpy as np

1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
3. month - month of the year: 'jan' to 'dec'
4. day - day of the week: 'mon' to 'sun'
5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
6. DMC - DMC index from the FWI system: 1.1 to 291.3
7. DC - DC index from the FWI system: 7.9 to 860.6
8. ISI - ISI index from the FWI system: 0.0 to 56.10
9. temp - temperature in Celsius degrees: 2.2 to 33.30
10. RH - relative humidity in %: 15.0 to 100
11. wind - wind speed in km/h: 0.40 to 9.40
12. rain - outside rain in mm/m2 : 0.0 to 6.4
13. area - the burned area of the forest (in ha): 0.00 to 1090.84

Key*
FFMC - Fine Fuel Moisture Code
DMC - The Duff Moisture Code (DMC) represents fuel moisture of decomposed organic material underneath the litter
DC - The Drought Code represents the amount of dryness deep into the soil. 
ISI - The Initial Spread Index integrates fuel moisture for fine dead fuels and surface windspeed to estimate a spread potential

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv

--2020-09-04 19:33:21--  https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25478 (25K) [application/x-httpd-php]
Saving to: ‘forestfires.csv’


2020-09-04 19:33:22 (471 KB/s) - ‘forestfires.csv’ saved [25478/25478]



In [None]:
FFiresData = pd.read_csv("forestfires.csv")

### Specific Columns hold specific kinds of data

In [None]:
FFiresData.head()
FFiresData.tail()

FFiresData


Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.00
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.00
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.00
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.00
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,4,3,aug,sun,81.6,56.7,665.6,1.9,27.8,32,2.7,0.0,6.44
513,2,4,aug,sun,81.6,56.7,665.6,1.9,21.9,71,5.8,0.0,54.29
514,7,4,aug,sun,81.6,56.7,665.6,1.9,21.2,70,6.7,0.0,11.16
515,1,4,aug,sat,94.4,146.0,614.7,11.3,25.6,42,4.0,0.0,0.00


In [None]:
FFiresData.dtypes

X          int64
Y          int64
month     object
day       object
FFMC     float64
DMC      float64
DC       float64
ISI      float64
temp     float64
RH         int64
wind     float64
rain     float64
area     float64
dtype: object

Some columns hold integer values like the `BedroomAbvGr` which stands for "Bedrooms Above Grade." This is the number of non-basement bedrooms in the home.

For more information on specific column meanings view the [data dictionary](https://github.com/ryanleeallred/datasets/blob/master/Ames%20Housing%20Data/data_description.txt).

In [None]:
# first ten rows

FFiresData['RH'].head(10)




0    51
1    33
2    33
3    97
4    99
5    29
6    27
7    86
8    63
9    40
Name: RH, dtype: int64

In [None]:
FFiresData['RH'].sample(10)

258    59
452    65
148    43
22     44
24     32
331    28
224    57
146    40
324    56
82     51
Name: RH, dtype: int64

Some columns hold float values like the `LotFrontage` column.

In [None]:
FFiresData['DMC'].head(10)

0     26.2
1     35.4
2     43.7
3     33.3
4     51.3
5     85.3
6     88.9
7    145.4
8    129.5
9     88.0
Name: DMC, dtype: float64

In [None]:
FFiresData['DMC'].value_counts(dropna = False) # This column is cast a float because of the decimal addition.

99.0     10
129.5     9
142.4     8
231.1     8
137.0     7
         ..
4.6       1
24.9      1
133.6     1
96.3      1
3.2       1
Name: DMC, Length: 215, dtype: int64

In [None]:
FFiresData['DMC'].head()

0    26.2
1    35.4
2    43.7
3    33.3
4    51.3
Name: DMC, dtype: float64

### Making new Features

Focused Columns (Slim Down)




In [None]:
FocusedFFD = FFiresData[['DMC', 'DC', 'FFMC', 'ISI']]   

FocusedFFD

Unnamed: 0,DMC,DC,FFMC,ISI
0,26.2,94.3,86.2,5.1
1,35.4,669.1,90.6,6.7
2,43.7,686.9,90.6,6.7
3,33.3,77.5,91.7,9.0
4,51.3,102.2,89.3,9.6
...,...,...,...,...
512,56.7,665.6,81.6,1.9
513,56.7,665.6,81.6,1.9
514,56.7,665.6,81.6,1.9
515,146.0,614.7,94.4,11.3


### Syntax for creating new columns

When making a new column on a dataframe, we have to use the square bracket syntax of accessing a column. We can't use "dot syntax" here.

According to malagaweather.com when the DMC is 90 + there is an extreme chance of wildfire.

In [None]:
HighChnc =  FocusedFFD['DMC']> 90

HighChnc

0      False
1      False
2      False
3      False
4      False
       ...  
512    False
513    False
514    False
515     True
516    False
Name: DMC, Length: 517, dtype: bool

In [None]:
FocusedFFD['Extreme'] = np.where(FocusedFFD['DMC']> 90, "yes", "no")
FocusedFFD['Extreme'].value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


yes    334
no     183
Name: Extreme, dtype: int64

Lets look at the NaN values of each column so that you can see the problem that the extra rows at the bottom of the file are creating for us

In [None]:
# Sum null values by column and sort from least to greatest
pd.set_option('display.max_rows', 200)

FFiresData.isnull().sum().sort_values()


X        0
Y        0
month    0
day      0
FFMC     0
DMC      0
DC       0
ISI      0
temp     0
RH       0
wind     0
rain     0
area     0
dtype: int64

For good measure, we'll also drop some columns that are made up completely of NaN values.

Why might LendingClub have included columns in their dataset that are 100% blank?

In [None]:
FFiresData = FFiresData.drop(['day'], axis=1)  # No more day column!

FFiresData.head()

Unnamed: 0,X,Y,month,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


### Column Cleanup

When we're preparing a dataset for a machine learning model we typically want to represent don't want to leave any string values in our dataset --because it's hard to do math on words. 

Specifically, we have a column that is representing a numeric value, but currently doesn't have a numeric datatype. Lets look at the first 10 values of the column.

In [None]:
Rates = pd.DataFrame([['13%', 3], ['15%', 4]], columns=['int_rate', 'prime_rate'])
Rates

Unnamed: 0,int_rate,prime_rate
0,13%,3
1,15%,4


In [None]:
Rates['int_rate']


0    13%
1    15%
Name: int_rate, dtype: object

In [None]:
# Pull a specific Line

Rates['int_rate'][1]

'15%'

Problems that we need to address with this column:

- String column that should be numeric
- Percent Sign `%` included with the number
- Leading space at the beginning of the string

However, we're not going to try and write exactly the right code to fix this column in one go. We're going to methodically build up to the code that will help us address these problems.


In [None]:
int_rate = '15%'

In [None]:
int_rate.strip("%")

'15'

In [None]:
float('15')

15.0

In [None]:
# "Cast" the string value to a float

type(float(int_rate.strip().strip("%")))

float

### Write a function to make our solution reusable!

In [None]:
# Write a function that can do what we have written above to any 
# string that is passsed to it.

def int_rate_to_float(cell_contents):
  return float(cell_contents.strip().strip("%"))

In [None]:
# Test out our function by calling it on our example
int_rate_to_float(int_rate)




15.0

In [None]:
# is the data type correct?
type(int_rate_to_float(int_rate))

float

### Apply our solution to every cell in a column

In [None]:
Rates

Unnamed: 0,int_rate,prime_rate
0,13%,3
1,15%,4


In [None]:
Updated_Rates = []

for cell in Rates['int_rate']:
  Updated_Rates.append(int_rate_to_float(cell))

Updated_Rates 

[13.0, 15.0]

In [None]:
#Add new list to be a new column in my dataframe
Rates['int_rate_cleaned'] = pd.Series(Updated_Rates)

Rates


Unnamed: 0,int_rate,prime_rate,int_rate_cleaned
0,13%,3,13.0
1,15%,4,15.0


In [None]:
# What type of data is held in our new column?

# Look at the datatypes of the last 5 columns
Rates.dtypes[-5:]

int_rate             object
prime_rate            int64
int_rate_cleaned    float64
dtype: object

In [None]:
#Improve this with the apply function  This method in this series is repetitive
Rates['int_rate_cleaned_from_apply'] = Rates['int_rate'].apply(int_rate_to_float)

Rates

Unnamed: 0,int_rate,prime_rate,int_rate_cleaned,int_rate_cleaned_from_apply
0,13%,3,13.0,13.0
1,15%,4,15.0,15.0


## Challenge

We can create a new column with our cleaned values or overwrite the original, whatever we think best suits our needs. On your assignment you will take the same approach in trying to methodically build up the complexity of your code until you have a few lines that will work for any cell in a column. At that point you'll contain all of that functionality in a reusable function block and then use the `.apply()` function to... well... apply those changes to an entire column.

# [Objective](#pandas-apply) Modify and Create Columns using `.apply()`



## Overview

We're already seen one example of using the `.apply()` function to clean up a column. Lets see if we can do it again, but this time on a slightly more complicated use case.

Remember, the goal here is to write a function that will work correctly on any **individual** cell of a specific column. Then we can reuse that function on those individual cells of a dataframe column via the `.apply()` function.

Lets clean up the emp_title "Employment Title" column!

## Follow Along

First we'll try and diagnose how bad the problem is and what improvements we might be able to make.

In [None]:
# Top 20 Months with high or exreme data 
FFiresData['month'].value_counts(dropna = False)[:20]

aug    184
sep    172
mar     54
jul     32
feb     20
jun     17
oct     15
dec      9
apr      9
jan      2
may      2
nov      1
Name: month, dtype: int64

In [None]:
# 12 Months
len(FFiresData['month'].value_counts(dropna = False))

12

In [95]:
# How often is the employment_title null?
FFiresData['month'].isnull().sum()

0

What are some possible reasons as to why a person's employment title may have not been provided?

In [105]:
# Create some examples that represent the cases that we want to clean up
#examples = ['owner', "Supervisor", ' Project Manager']

Updates = ['january', 'february', 'march', 'june', 'july', 'august', 'september', 'october', 'november', 'december', np.NaN]



In [106]:
# list comprehensions can combine function calls and for loops over lists
# into one succinct and fairly readable single line of code.
[clean_title(title) for title in Updates]

['January',
 'February',
 'March',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December',
 'Unknown']

In [109]:
# Write a function to clean up these use cases and increase uniformity.
def clean_title(month):
   if isinstance(month, str): 
    return month.title().strip()
   else:
    return "Unknown"

for title in Updates:
  print(clean_title(title))

January
February
March
June
July
August
September
October
November
December
Unknown


In [110]:
# We have a function that works as expected. Lets apply it to our column.
# This time we'll overwrite the original column

FFiresData['month'] = FFiresData['month'].apply(clean_title)

FFiresData['month']

0      Mar
1      Oct
2      Oct
3      Mar
4      Mar
      ... 
512    Aug
513    Aug
514    Aug
515    Aug
516    Nov
Name: month, Length: 517, dtype: object

We can use the same code as we did earlier to see how much progress was made.


In [111]:
# Look at the top 20 employment titles
FFiresData['month'].value_counts(ascending = False)[:20]


Aug    184
Sep    172
Mar     54
Jul     32
Feb     20
Jun     17
Oct     15
Dec      9
Apr      9
May      2
Jan      2
Nov      1
Name: month, dtype: int64

In [112]:
# How many different unique employment titles are there currently?
len(FFiresData['month'].value_counts())


12

In [113]:
# How often is the employment_title null (NaN)?
FFiresData['month'].isnull().sum()

0

## Challenge

Using the .apply() function isn't always about creating new columns on a dataframe, we can use it to clean up or modify existing columns as well. 

# [Objective](#dates-and-times) Work with Dates and Times with Pandas

## Overview

Pandas has its own datatype datatype that makes it extremely convenient to convert strings that are in standard date formates to datetime objects and then use those datetime objects to either create new features on a dataframe or work with the dataset in a timeseries fashion. 

This section will demonstrate how to take a column of date strings, convert it to a datetime object and then use the datetime formatting `.dt` to access specific parts of the date (year, month, day) to generate useful columns on a dataframe.

## Follow Along

### Work with Dates 

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

Many of the most useful date columns in this dataset have the suffix `_d` to indicate that they correspond to dates.

We'll use a list comprehension to print them out

In [115]:
[col for col in FFiresData if col.endswith('_d')]

[]

Lets look at the string format of the `issue_d` column

In [116]:
df['issue_d'][:10]

Because this string format %m-%y is a common datetime format, we can just let Pandas detect this format and translate it to the appropriate datetime object.

In [None]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

df.dtypes[:15]

Now we can see that the `issue_d` column has been changed to hold `datetime` objects.

Lets look at one of the cells specifically to see what a datetime object looks like:

In [None]:
df['issue_d'].iloc[0]

df['issue_d'][0]

You can see how the month and year have been indicated by the strings that were contained in the column previously, and that the rest of the values have been inferred.

In [None]:
df['issue_d'].dt.year

We can use the `.dt` accessor to now grab specific parts of the datetime object. Lets grab just the year from the all of the cells in the `issue_d` column

Now the month.

It's just that easy! Now, instead of printing them out, lets add these year and month values as new columns on our dataframe. Again, you'll have to scroll all the way over to the right in the table to see the new columns.

Because all of these dates come from Q4 of 2018, the `issue_d` column isn't all that interesting. Lets look at the `earliest_cr_line` column, which is also a string, but that could be converted to datetime format.

We're going to create a new column called `days_from_earliest_credit_to_issue`

It's a long column header, but think about how valuable this piece of information could be. This number will essentially indicate the length of a person's credit history and if that is correlated with repayment or other factors could be a valuable predictor!

What we're about to do is so cool! Pandas' datetime format is so smart that we can simply use the subtraction operator `-` in order to calculate the amount of time between two dates. 

Think about everything that's going on under the hood in order to give us such straightforward syntax! Handling months of different lengths, leap years, etc. Pandas datetime objects are seriously powerful!

What's oldest credit history that was involved in Q4 2018? 

25,171 days is ~ 68.96 years of credit history!

## Challenge

Pandas' datetime format is so easy to work with that there's really no excuse for not using dates to make features on a dataframe! Get ready to practice more of this on your assignment.

`NaN` stands stands for "Not a Number" and is the default missing value indicator with Pandas. This means there were cells in this column that didn't have a LotFrontage value recorded for those homes. 

This is where domain knowledge starts to come in. Think about the context we're working with here: houses. What might a null or blank cell representing "Linear feet of street connected to property" mean in the context of a housing dataset?

Ok, so maybe it makes seanse to have some NaNs in this column. What is the datatype of a NaN value?

Perhaps some of this data is truly missing or unrecorded data, but sometimes `NaNs` are more likely to indicate something that was "NA" or "Not Applicable" to a particular observation. There could be multiple reasons why there was no value recorded for a particular feature.

Remember, that Pandas tries to maintain a single datatype for all values in a column, and therefore...

The datatype of a NaN is float!  This means that if we have a column of integer values, but the column has even a single `NaN` that column will not be treated with the integer datatype but all of the integers will be converted to floats in order to try and preserve the same datatype throughout the entire column.

You can see already how understanding column datatypes is crucial to understanding how Pandas help us manage our data.

In [None]:
# We can fix the header problem by using the 'skiprows' parameter
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1)

df = pd.read_csv('LoanStats_2018Q4.csv', header=1)

df