**Why generate features?**
___
- Different types of data
    - continuous: integers or floats
    - categorical: one of a limited set of values
    - ordinal: ranked values
    - boolean: true/false values
    - datetime: dates and times
- course structure
    - chapter 1: feature creation and extraction
    - chapter 2: engineering messy data
    - chapter 3: feature normalization
    - chapter 4: working with text features
___

In [None]:
#Getting to know your data

#Pandas is one the most popular packages used to work with tabular
#data in Python. It is generally imported using the alias pd and can
#be used to load a CSV (or other delimited files) using read_csv().

#You will be working with a modified subset of the Stackoverflow
#survey response data in the first three chapters of this course.
#This data set records the details, and preferences of thousands of
#users of the StackOverflow website.

# Import pandas
#import pandas as pd

# Import so_survey_csv into so_survey_df
#so_survey_df = pd.read_csv(so_survey_csv)

# Print the first five rows of the DataFrame
#print(so_survey_df.head())

#################################################
#<script.py> output:
#          SurveyDate                                    FormalEducation  ConvertedSalary Hobby       Country  ...     VersionControl Age  Years Experience  Gender   RawSalary
#    0  2/28/18 20:20           Bachelor's degree (BA. BS. B.Eng.. etc.)              NaN   Yes  South Africa  ...                Git  21                13    Male         NaN
#    1  6/28/18 13:26           Bachelor's degree (BA. BS. B.Eng.. etc.)          70841.0   Yes       Sweeden  ...     Git;Subversion  38                 9    Male   70,841.00
#    2    6/6/18 3:37           Bachelor's degree (BA. BS. B.Eng.. etc.)              NaN    No       Sweeden  ...                Git  45                11     NaN         NaN
#    3    5/9/18 1:06  Some college/university study without earning ...          21426.0   Yes       Sweeden  ...  Zip file back-ups  46                12    Male   21,426.00
#    4  4/12/18 22:41           Bachelor's degree (BA. BS. B.Eng.. etc.)          41671.0   Yes            UK  ...                Git  39                 7    Male  £41,671.00
#
#    [5 rows x 11 columns]
#################################################

# Print the data type of each column
#print(so_survey_df.dtypes)

#################################################
#    SurveyDate                     object
#    FormalEducation                object
#    ConvertedSalary               float64
#    Hobby                          object
#    Country                        object
#    StackOverflowJobsRecommend    float64
#    VersionControl                 object
#    Age                             int64
#    Years Experience                int64
#    Gender                         object
#    RawSalary                      object
#    dtype: object
#################################################

In [None]:
#Selecting specific data types
#Often a data set will contain columns with several different data
#types (like the one you are working with). The majority of machine
#learning models require you to have a consistent data type across
#features. Similarly, most feature engineering techniques are
#applicable to only one type of data at a time. For these reasons
#among others, you will often want to be able to access just the
#columns of certain types when working with a DataFrame.

#The DataFrame (so_survey_df) from the previous exercise is available
#in your workspace.

# Create subset of only the numeric columns
#so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_survey_df_num
#print(so_numeric_df.columns)

#################################################
#<script.py> output:
#    Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age', 'Years Experience'], dtype='object')
#################################################
# In the next lesson, you will learn the most common ways of dealing
#with categorical data.

**Dealing with categorical features**
___
- encoding categorical features
    - one-hot encoding
        - converts n categories into n features
        - explainable features
        - problem of collinearity
    - dummy encoding
        - converts n categories into n-1 features
        -
___

In [None]:
#One-hot encoding and dummy variables

#To use categorical variables in a machine learning model, you first
#need to represent them in a quantitative way. The two most common
#approaches are to one-hot encode the variables using or to use dummy
#variables. In this exercise, you will create both types of encoding,
#and compare the created column sets. We will continue using the same
#DataFrame from previous lesson loaded as so_survey_df and focusing on
#its Country column.

# Convert the Country column to a one hot encoded Data Frame
#one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
#print(one_hot_encoded.columns)

#################################################
#<script.py> output:
#    Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby', 'StackOverflowJobsRecommend', 'VersionControl', 'Age', 'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
#           'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden', 'OH_UK', 'OH_USA', 'OH_Ukraine'],
#          dtype='object')
#################################################

# Create dummy variables for the Country column
#dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
#print(dummy.columns)

#################################################
#<script.py> output:
#    Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby', 'StackOverflowJobsRecommend', 'VersionControl', 'Age', 'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
#           'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK', 'DM_USA', 'DM_Ukraine'],
#          dtype='object')
#################################################
#Did you notice that the column for France was missing when you
#created dummy variables? Now you can choose to use one-hot encoding
#or dummy variables where appropriate.

In [None]:
#Dealing with uncommon categories

#Some features can have many different categories but a very uneven
#distribution of their occurrences. Take for example Data Science's
#favorite languages to code in, some common choices are Python, R,
#and Julia, but there can be individuals with bespoke choices, like
#FORTRAN, C etc. In these cases, you may not want to create a feature
#for each value, but only the more common occurrences.

# Create a series out of the Country column
#countries = so_survey_df['Country']

# Get the counts of each category
#country_counts = countries.value_counts()

# Print the count values for each category
#print(country_counts)

#################################################
#<script.py> output:
#    South Africa    166
#    USA             164
#    Spain           134
#    Sweeden         119
#    France          115
#    Russia           97
#    India            95
#    UK               95
#    Ukraine           9
#    Ireland           5
#    Name: Country, dtype: int64
#################################################

# Create a mask for only categories that occur less than 10 times
#mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
#print(mask.head())

#################################################
#<script.py> output:
#    0    False
#    1    False
#    2    False
#    3    False
#    4    False
#    Name: Country, dtype: bool
#################################################

# Label all other categories as Other
#countries[mask] = 'Other'

# Print the updated category counts
#print(pd.value_counts(countries))

#################################################
#<script.py> output:
#    South Africa    166
#    USA             164
#    Spain           134
#    Sweeden         119
#    France          115
#    Russia           97
#    India            95
#    UK               95
#    Other            14
#    Name: Country, dtype: int64
#################################################
#now you can work with large data sets while grouping low frequency
#categories.

**Numeric variables**
___
- Types of numeric features
    - age
    - price
    - counts
    - geospatial data
___

In [None]:
#Binarizing columns

#While numeric values can often be used without any feature
#engineering, there will be cases when some form of manipulation
#can be useful. For example on some occasions, you might not care
#about the magnitude of a value but only care about its direction,
#or if it exists at all. In these situations, you will want to
#binarize a column. In the so_survey_df data, you have a large
#number of survey respondents that are working voluntarily (without
#pay). You will create a new column titled Paid_Job indicating
#whether each person is paid (their salary is greater than zero).

# Create the Paid_Job column filled with zeros
#so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
#so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
#print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

#################################################
#<script.py> output:
#       Paid_Job  ConvertedSalary
#    0         0              0.0
#    1         1          70841.0
#    2         0              0.0
#    3         1          21426.0
#    4         1          41671.0
#################################################
#binarizing columns can also be useful for your target variables.

In [None]:
#Binning values

#For many continuous values you will care less about the exact value
#of a numeric column, but instead care about the bucket it falls into.
#This can be useful when plotting values, or simplifying your machine
#learning models. It is mostly used on continuous variables where
#accuracy is not the biggest concern e.g. age, height, wages.

#Bins are created using pd.cut(df['column_name'], bins) where bins can
#be an integer specifying the number of evenly spaced bins, or a list
#of bin boundaries.

# Bin the continuous variable ConvertedSalary into 5 bins
#so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 5)

# Print the first 5 rows of the equal_binned column
#print(so_survey_df[['equal_binned', 'ConvertedSalary']].head())

#################################################
#<script.py> output:
#              equal_binned  ConvertedSalary
#    0  (-2000.0, 400000.0]              0.0
#    1  (-2000.0, 400000.0]          70841.0
#    2  (-2000.0, 400000.0]              0.0
#    3  (-2000.0, 400000.0]          21426.0
#    4  (-2000.0, 400000.0]          41671.0
#################################################

# Import numpy
#import numpy as np

# Specify the boundaries of the bins
#bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
#labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
#so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'],
#                                         bins, labels = labels)

# Print the first 5 rows of the boundary_binned column
#print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())

#################################################
#<script.py> output:
#      boundary_binned  ConvertedSalary
#    0        Very low              0.0
#    1          Medium          70841.0
#    2        Very low              0.0
#    3             Low          21426.0
#    4             Low          41671.0
#################################################
#now you can bin columns with equal spacing and predefined boundaries.

**Why do missing values exist?**
___
- How gaps in data occur
    - data not being collected properly
    - collection and management errors
    - data intentionally being omitted
    - could be created due to transformations of the data
- Why we care?
    - some models cannot work with missing data (Nulls/NaNs)
    - missing data may be a sign of a wider data issue
    - missing data can be a useful feature
- pd.info()
- pd.isnull()
- pd.isnull().sum()
- df.notnull()
___

In [None]:
#How sparse is my data?
#Most data sets contain missing values, often represented as NaN
#(Not a Number). If you are working with Pandas you can easily check
#how many missing values exist in each column.

#Let's find out how many of the developers taking the survey chose to
#enter their age (found in the Age column of so_survey_df) and their
#gender (Gender column of so_survey_df).

# Subset the DataFrame
#sub_df = so_survey_df[['Age', 'Gender']]

# Print the number of non-missing values
#print(sub_df.info())

#################################################
#<script.py> output:
#    <class 'pandas.core.frame.DataFrame'>
#    RangeIndex: 999 entries, 0 to 998
#    Data columns (total 2 columns):
#    Age       999 non-null int64
#    Gender    693 non-null object
#    dtypes: int64(1), object(1)
#    memory usage: 15.7+ KB
#    None
#################################################

In [None]:
#Finding the missing values

#While having a summary of how much of your data is missing can be
#useful, often you will need to find the exact locations of these
#missing values. Using the same subset of the StackOverflow data
#from the last exercise (sub_df), you will show how a value can be
#flagged as missing.

# Print the top 10 entries of the DataFrame
#print(sub_df.head(10))

#################################################
#<script.py> output:
#       Age  Gender
#    0   21    Male
#    1   38    Male
#    2   45     NaN
#    3   46    Male
#    4   39    Male
#    5   39    Male
#    6   34    Male
#    7   24  Female
#    8   23    Male
#    9   36     NaN
#################################################

# Print the locations of the missing values
#print(sub_df.head(10).isnull())

#################################################
#<script.py> output:
#         Age  Gender
#    0  False   False
#    1  False   False
#    2  False    True
#    3  False   False
#    4  False   False
#    5  False   False
#    6  False   False
#    7  False   False
#    8  False   False
#    9  False    True
#################################################

# Print the locations of the non-missing values
#print(sub_df.head(10).notnull())

#################################################
#<script.py> output:
#        Age  Gender
#    0  True    True
#    1  True    True
#    2  True   False
#    3  True    True
#    4  True    True
#    5  True    True
#    6  True    True
#    7  True    True
#    8  True    True
#    9  True   False
#################################################
# finding where the missing values exist can often be important.

**Dealing with missing values (I)**
___
- pd.dropna()
- pd.drop()
- random omissions
    - complete case analysis / listwise deletion
    - drawbacks
        - deletes valid data points as well
        - relies on randomness
        - reduces information if a feature is removed (degrees of freedom)
- replacement
- recording missing values
___

In [None]:
#Listwise deletion

#The simplest way to deal with missing values in your dataset when
#they are occurring entirely at random is to remove those rows, also
#called 'listwise deletion'.

#Depending on the use case, you will sometimes want to remove all
#missing values in your data while other times you may want to only
#remove a particular column if too many values are missing in that
#column.

# Print the number of rows and columns
#print(so_survey_df.shape)

#################################################
#<script.py> output:
#    (999, 11)
#################################################

# Create a new DataFrame dropping all incomplete rows
#no_missing_values_rows = so_survey_df.dropna()

# Print the shape of the new DataFrame
#print(no_missing_values_rows.shape)

#################################################
#<script.py> output:
#    (264, 11)
#################################################

# Create a new DataFrame dropping all columns with incomplete rows
#no_missing_values_cols = so_survey_df.dropna(how='any', axis=1)

# Print the shape of the new DataFrame
#print(no_missing_values_cols.shape)

#################################################
#<script.py> output:
#    (999, 7)
#################################################

# Drop all rows where Gender is missing
#no_gender = so_survey_df.dropna(subset=['Gender'])

# Print the shape of the new DataFrame
#print(no_gender.shape)

#################################################
#<script.py> output:
#    (693, 11)
#################################################
#as you can see dropping all rows that contain any missing values
#may greatly reduce the size of your dataset. So you need to think
#carefully and consider several trade-offs when deleting missing
#values.

In [None]:
#Replacing missing values with constants

#While removing missing data entirely maybe a correct approach in
#many situations, this may result in a lot of information being
#omitted from your models.

#You may find categorical columns where the missing value is a valid
#piece of information in itself, such as someone refusing to answer
#a question in a survey. In these cases, you can fill all missing
#values with a new category entirely, for example 'No response given'.

# Print the count of occurrences
#print(so_survey_df['Gender'].value_counts())

#################################################
#<script.py> output:
#    Male                                                                        632
#    Female                                                                 53
#    Female;Male                                                                 2
#    Transgender                                                               2
#    Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
#    Female;Transgender                                                             1
#    Male;Non-binary. genderqueer. or gender non-conforming                         1
#    Non-binary. genderqueer. or gender non-conforming                              1
#    Name: Gender, dtype: int64
#################################################

# Replace missing values
#so_survey_df['Gender'].fillna(value='Not Given', inplace=True)

# Print the count of each value
#print(so_survey_df['Gender'].value_counts())

#################################################
#<script.py> output:
#    Male                                                                         632
#    Not Given                                                                    306
#    Female                                                                        53
#    Female;Male                                                                    2
#    Transgender                                                                    2
#    Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
#    Female;Transgender                                                             1
#    Male;Non-binary. genderqueer. or gender non-conforming                         1
#    Non-binary. genderqueer. or gender non-conforming                              1
#    Name: Gender, dtype: int64
#################################################
#By filling in these missing values you can use the columns in your
#analyses.

**Dealing with missing values (II)**
___
- If you cannot drop rows, what else can you do?
    - **Categorical columns**: replace missing values with the most common occurring value or with a string that flags missing values such as 'None'
    - **Numeric columns**: replace missing values with a suitable value
        - measure of central tendency, e.g., mean, median
- impute values based on the train set to both train and test sets.
___

In [None]:
#Filling continuous missing values

#In the last lesson, you dealt with different methods of removing
#data missing values and filling in missing values with a fixed
#string. These approaches are valid in many cases, particularly when
#dealing with categorical columns but have limited use when working
#with continuous values. In these cases, it may be most valid to fill
#the missing values in the column with a value calculated from the
#entries present in the column.

# Print the first five rows of StackOverflowJobsRecommend column
#print(so_survey_df['StackOverflowJobsRecommend'].head())

#################################################
#<script.py> output:
#    0    NaN
#    1    7.0
#    2    8.0
#    3    NaN
#    4    8.0
#    Name: StackOverflowJobsRecommend, dtype: float64
#################################################

# Fill missing values with the mean
#so_survey_df['StackOverflowJobsRecommend'].fillna(so_survey_df['StackOverflowJobsRecommend'].mean(), inplace=True)

# Print the first five rows of StackOverflowJobsRecommend column
#print(so_survey_df['StackOverflowJobsRecommend'].head())

#################################################
#<script.py> output:
#    0    7.061602
#    1    7.000000
#    2    8.000000
#    3    7.061602
#    4    8.000000
#    Name: StackOverflowJobsRecommend, dtype: float64
#################################################

# Round the StackOverflowJobsRecommend values
#so_survey_df['StackOverflowJobsRecommend'] = np.round (so_survey_df['StackOverflowJobsRecommend'])

# Print the top 5 rows
#print(so_survey_df['StackOverflowJobsRecommend'].head())

#################################################
#<script.py> output:
#    0    7.0
#    1    7.0
#    2    8.0
#    3    7.0
#    4    8.0
#    Name: StackOverflowJobsRecommend, dtype: float64
#################################################
#remember you should only round your values if you are certain it is applicable.

**Dealing with other data issues**
___

In [None]:
#Dealing with stray characters (I)

#In this exercise, you will work with the RawSalary column of
#so_survey_df which contains the wages of the respondents along with
#the currency symbols and commas, such as $42,000. When importing
#data from Microsoft Excel, more often than not you will come across
#data in this form.

# Remove the commas in the column
#so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')

# Remove the dollar signs in the column
#so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$', '')

#################################################
#Replacing/removing specific characters is a very useful skill.

In [None]:
#Dealing with stray characters (II)

#In the last exercise, you could tell quickly based off of the
#df.head() call which characters were causing an issue. In many cases
#this will not be so apparent. There will often be values deep within
#a column that are preventing you from casting a column as a numeric
#type so that it can be used in a model or further feature engineering.

#One approach to finding these values is to force the column to the
#data type desired using pd.to_numeric(), coercing any values causing
#issues to NaN, Then filtering the DataFrame by just the rows containing
#the NaN values.

#Try to cast the RawSalary column as a float and it will fail as an
#additional character can now be found in it. Find the character and
#remove it so the column can be cast as a float.

# Attempt to convert the column to numeric values
#numeric_vals = pd.to_numeric(so_survey_df['RawSalary'], errors='coerce')

# Find the indexes of missing values
#idx = numeric_vals.isna()

# Print the relevant rows
#print(so_survey_df['RawSalary'][idx])

#################################################
#<script.py> output:
#    0             NaN
#    2             NaN
#    4       £41671.00
#    6             NaN
#    8             NaN
#    ...
#    49      £19500.00
#    50            NaN
#    52            NaN
#    53      £36000.00
#    54            NaN
#    Name: RawSalary, Length: 401, dtype: object
#################################################

# Replace the offending characters
#so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('£', '')

# Convert the column to float
#so_survey_df['RawSalary'] = so_survey_df['RawSalary'].astype('float')

# Print the column
#print(so_survey_df['RawSalary'])

#################################################
#<script.py> output:
#    0            NaN
#    1        70841.0
#    2            NaN
#    3        21426.0
#    4        41671.0
#    ...
#    994          NaN
#    995      58746.0
#    996      55000.0
#    997          NaN
#    998    1000000.0
#    Name: RawSalary, Length: 999, dtype: float64
#################################################
#Remember that even after removing all the relevant characters, you
#still need to change the type of the column to numeric if you want
#to plot these continuous values.

In [None]:
#Method chaining

#When applying multiple operations on the same column (like in the
#previous exercises), you made the changes in several steps, assigning
#the results back in each step. However, when applying multiple
#successive operations on the same column, you can "chain" these
#operations together for clarity and ease of management. This can be
#achieved by calling multiple methods sequentially:

# Method chaining
#df['column'] = df['column'].method1().method2().method3()

# Same as
#df['column'] = df['column'].method1()
#df['column'] = df['column'].method2()
#df['column'] = df['column'].method3()

#In this exercise you will repeat the steps you performed in the last
#two exercises, but do so using method chaining.

# Use method chaining
#so_survey_df['RawSalary'] = so_survey_df['RawSalary']\
#                              .str.replace(',', '')\
#                              .str.replace('$', '')\
#                              .str.replace('£', '')\
#                              .astype('float')

# Print the RawSalary column
#print(so_survey_df['RawSalary'])

#################################################
#<script.py> output:
#    0            NaN
#    1        70841.0
#    2            NaN
#    3        21426.0
#    4        41671.0
#    ...
#    994          NaN
#    995      58746.0
#    996      55000.0
#    997          NaN
#    998    1000000.0
#    Name: RawSalary, Length: 999, dtype: float64
#################################################
#Custom functions can be also used when method chaining using the
#.apply() method.

**Data distributions**
___
- most models assume your data is normally distributed and/or on the same scale
    - 1 sd = 66.27%; 2 sd = 95.45%; 3 sd = 99.73%
- decision tree-based models do not make this assumption
    - As decision trees split along a singular point, they do not require all the columns to be on the same scale.
- delving deeper with box plots
    - Interquartile Range (IQR) = 25th (Q1) percentile to 75th (Q3) percentile
    - Minimum = Q1 - 1.5 IQR
    - Maximum = Q3 + 1.5 IQR
    - outliers are outside Minimum or Maximum
- pairing distributions with seaborn library
___

In [None]:
#What does your data look like? (I)

#Up until now you have focused on creating new features and dealing
#with issues in your data. Feature engineering can also be used to
#make the most out of the data that you already have and use it more
#effectively when creating machine learning models.

#Many algorithms may assume that your data is normally distributed,
#or at least that all your columns are on the same scale. This will
#often not be the case, e.g. one feature may be measured in thousands
#of dollars while another would be number of years. In this exercise,
#you will create plots to examine the distributions of some numeric
#columns in the so_survey_df DataFrame, stored in so_numeric_df.

# Create a histogram
#so_numeric_df.hist()
#plt.show()

![_images/16.1.svg](_images/16.1.svg)

In [None]:
# Create a boxplot of two columns
#so_numeric_df[['Age', 'Years Experience']].boxplot()
#plt.show()

![_images/16.2.svg](_images/16.2.svg)

In [None]:
# Create a boxplot of ConvertedSalary
#so_numeric_df[['ConvertedSalary']].boxplot()
#plt.show()

![_images/16.3.svg](_images/16.3.svg)
as you can see the distributions of columns in a dataset can vary quite a bit.

In [None]:
#What does your data look like? (II)

#In the previous exercise you looked at the distribution of individual
#columns. While this is a good start, a more detailed view of how
#different features interact with each other may be useful as this
#can impact your decision on what to transform and how.

# Import packages
#import matplotlib.pyplot as plt
#import seaborn as sns

# Plot pairwise relationships
#sns.pairplot(so_numeric_df)

# Show plot
#plt.show()

![_images/16.4.svg](_images/16.4.svg)

In [None]:
# Print summary statistics
#print(so_numeric_df.describe())

#################################################
#<script.py> output:
#           ConvertedSalary         Age  Years Experience
#    count     9.990000e+02  999.000000        999.000000
#    mean      6.161746e+04   36.003003          9.961962
#    std       1.760924e+05   13.255127          4.878129
#    min       0.000000e+00   18.000000          0.000000
#    25%       0.000000e+00   25.000000          7.000000
#    50%       2.712000e+04   35.000000         10.000000
#    75%       7.000000e+04   45.000000         13.000000
#    max       2.000000e+06   83.000000         27.000000
#################################################
#understanding these summary statistics of a column can be very
#valuable when deciding what transformations are necessary.

**Scaling and transformations**
___
- Min-Max scaling / Normalization
    - distribution remains the same
    - values change to range 0-1
    - MinMaxScaler() from scikit-learn preprocessing module
- Standardization
    - centers distribution around the mean = zero
    - StandardScaler() from scikit-learn preprocessing module
- Log Transformation
    - can make highly skewed distributions less skewed
    - PowerTransformer() from scikit-learn preprocessing module
___

In [None]:
#Normalization

#As discussed in the video, in normalization you linearly scale the
#entire column between 0 and 1, with 0 corresponding with the lowest
#value in the column, and 1 with the largest.

#When using scikit-learn (the most commonly used machine learning
#library in Python) you can use a MinMaxScaler to apply normalization.
#(It is called this as it scales your values between a minimum and
#maximum value.)

# Import MinMaxScaler
#from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler
#MM_scaler = MinMaxScaler()

# Fit MM_scaler to the data
#MM_scaler.fit(so_numeric_df[['Age']])

# Transform the data using the fitted scaler
#so_numeric_df['Age_MM'] = MM_scaler.transform(so_numeric_df[['Age']])

# Compare the origional and transformed column
#print(so_numeric_df[['Age_MM', 'Age']].head())

#################################################
#<script.py> output:
#         Age_MM  Age
#    0  0.046154   21
#    1  0.307692   38
#    2  0.415385   45
#    3  0.430769   46
#    4  0.323077   39
#################################################
#Did you notice that all values have been scaled between 0 and 1?

In [None]:
#Standardization

#While normalization can be useful for scaling a column between two
#data points, it is hard to compare two scaled columns if even one of
#them is overly affected by outliers. One commonly used solution to
#this is called standardization, where instead of having a strict
#upper and lower bound, you center the data around its mean, and
#calculate the number of standard deviations away from mean each data
#point is.

# Import StandardScaler
#from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
#SS_scaler = StandardScaler()

# Fit SS_scaler to the data
#SS_scaler.fit(so_numeric_df[['Age']])

# Transform the data using the fitted scaler
#so_numeric_df['Age_SS'] = SS_scaler.transform(so_numeric_df[['Age']])

# Compare the origional and transformed column
#print(so_numeric_df[['Age_SS', 'Age']].head())

#################################################
#       Age_SS  Age
#    0 -1.132431   21
#    1  0.150734   38
#    2  0.679096   45
#    3  0.754576   46
#    4  0.226214   39
#################################################
#you can see that the values have been scaled linearly, but not
#between set values.

In [None]:
#Log transformation

#In the previous exercises you scaled the data linearly, which will
#not affect the data's shape. This works great if your data is
#normally distributed (or closely normally distributed), an assumption
#that a lot of machine learning models make. Sometimes you will work
#with data that closely conforms to normality, e.g the height or
#weight of a population. On the other hand, many variables in the
#real world do not follow this pattern e.g, wages or age of a
#population. In this exercise you will use a log transform on the
#ConvertedSalary column in the so_numeric_df DataFrame as it has a
#large amount of its data centered around the lower values, but
#contains very high values also. These distributions are said to have
#a long right tail.

# Import PowerTransformer
#from sklearn.preprocessing import PowerTransformer

# Instantiate PowerTransformer
#pow_trans = PowerTransformer()

# Train the transform on the data
#pow_trans.fit(so_numeric_df[['ConvertedSalary']])

# Apply the power transform to the data
#so_numeric_df['ConvertedSalary_LG'] = pow_trans.transform(so_numeric_df[['ConvertedSalary']])

# Plot the data before and after the transformation
#so_numeric_df[['ConvertedSalary', 'ConvertedSalary_LG']].hist()
#plt.show()

![_images/16.5.svg](_images/16.5.svg)
Did you notice the change in the shape of the distribution?
ConvertedSalary_LG column looks much more normal than the original
ConvertedSalary column.

**Removing outliers**
___
- Quantile based detection
- Standard deviation based detection
___

In [None]:
#Percentage based outlier removal

#One way to ensure a small portion of data is not having an overly
#adverse effect is by removing a certain percentage of the largest
#and/or smallest values in the column. This can be achieved by
#finding the relevant quantile and trimming the data using it with
#a mask. This approach is particularly useful if you are concerned
#that the highest values in your dataset should be avoided. When
#using this approach, you must remember that even if there are no
#outliers, this will still remove the same top N percentage from the
#dataset.

# Find the 95th quantile
#quantile = so_numeric_df['ConvertedSalary'].quantile(0.95)

# Trim the outliers
#trimmed_df = so_numeric_df[so_numeric_df['ConvertedSalary'] < quantile]

# The original histogram
#so_numeric_df[['ConvertedSalary']].hist()
#plt.show()
#plt.clf()

# The trimmed histogram
#trimmed_df[['ConvertedSalary']].hist()
#plt.show()

![_images/16.6.svg](_images/16.6.svg)
![_images/16.7.svg](_images/16.7.svg)
In the next exercise, you will work with a more statistically sound
approach in removing outliers.

In [None]:
#Statistical outlier removal

#While removing the top N% of your data is useful for ensuring that
#very spurious points are removed, it does have the disadvantage of
#always removing the same proportion of points, even if the data is
#correct. A commonly used alternative approach is to remove data that
#sits further than three standard deviations from the mean. You can
#implement this by first calculating the mean and standard deviation
#of the relevant column to find upper and lower bounds, and applying
#these bounds as a mask to the DataFrame. This method ensures that
#only data that is genuinely different from the rest is removed, and
#will remove fewer points if the data is close together.

![_images/16.3.svg](_images/16.3.svg)

In [None]:
# Find the mean and standard dev
#std = so_numeric_df['ConvertedSalary'].std()
#mean = so_numeric_df['ConvertedSalary'].mean()

# Calculate the cutoff
#cut_off = std * 3
#lower, upper = mean - cut_off, mean + cut_off

# Trim the outliers
#trimmed_df = so_numeric_df[(so_numeric_df['ConvertedSalary'] < upper) \
#                           & (so_numeric_df['ConvertedSalary'] > lower)]

# The trimmed box plot
#trimmed_df[['ConvertedSalary']].boxplot()
#plt.show()

![_images/16.8.svg](_images/16.8.svg)
Did you notice the scale change on the y-axis?

**Scaling and transforming new data**
___
- you fit and transform the training data, but only transform test data
- remove outliers from training set, but generally not on test test
- Why only use training data?
    - **data leakage** - using data that you won't have access to when assessing the performance of your model
___

In [None]:
#Train and testing transformations (I)

#So far you have created scalers based on a column, and then applied
#the scaler to the same data that it was trained on. When creating
#machine learning models you will generally build your models on
#historic data (train set) and apply your model to new unseen data
#(test set). In these cases you will need to ensure that the same
#scaling is being applied to both the training and test data.

#To do this in practice you train the scaler on the train set, and
#keep the trained scaler to apply it to the test set. You should
#never retrain a scaler on the test set.

#For this exercise and the next, we split the so_numeric_df
#DataFrame into train (so_train_numeric) and test (so_test_numeric)
#sets.

# Import StandardScaler
#from sklearn.preprocessing import StandardScaler

# Apply a standard scaler to the data
#SS_scaler = StandardScaler()

# Fit the standard scaler to the data
#SS_scaler.fit(so_train_numeric[['Age']])

# Transform the test data using the fitted scaler
#so_test_numeric['Age_ss'] = SS_scaler.transform(so_test_numeric[['Age']])
#print(so_test_numeric[['Age', 'Age_ss']].head())

#################################################
#       Age    Age_ss
#    700   35 -0.069265
#    701   18 -1.343218
#    702   47  0.829997
#    703   57  1.579381
#    704   41  0.380366
#################################################
#Data leakage is one of the most common mistakes data scientists tend
#to make, and I hope that you won't!

In [None]:
#Train and testing transformations (II)

#Similar to applying the same scaler to both your training and test
#sets, if you have removed outliers from the train set, you probably
#want to do the same on the test set as well. Once again you should
#ensure that you use the thresholds calculated only from the train
#set to remove outliers from the test set.

#Similar to the last exercise, we split the so_numeric_df DataFrame
#into train (so_train_numeric) and test (so_test_numeric) sets.

#train_std = so_train_numeric['ConvertedSalary'].std()
#train_mean = so_train_numeric['ConvertedSalary'].mean()

#cut_off = train_std * 3
#train_lower, train_upper = train_mean - cut_off, train_mean + cut_off

# Trim the test DataFrame
#trimmed_df = so_test_numeric[(so_test_numeric['ConvertedSalary'] < train_upper) \
#                             & (so_test_numeric['ConvertedSalary'] > train_lower)]

#################################################
#In the next chapter, you will deal with unstructured (text) data.

**Encoding text**
___
- Standardizing your text
    - remove unwanted characters using str.replace() and regular expressions
        - [a-zA-Z]: all letter characters
        - [^a-zA-Z]: all non letter characters
- Standardize the case
    - str.lower()
- length of text
    - .len()
- word count
- average length of word

In [None]:
#Cleaning up your text

#Unstructured text data cannot be directly used in most analyses.
#Multiple steps need to be taken to go from a long free form string
#to a set of numeric columns in the right format that can be
#ingested by a machine learning model. The first step of this process
#is to standardize the data and eliminate any characters that could
#cause problems later on in your analytic pipeline.

#In this chapter you will be working with a new dataset containing
#the inaugural speeches of the presidents of the United States
#loaded as speech_df, with the speeches stored in the text column.

# Print the first 5 rows of the text column
#print(speech_df['text'].head())

#################################################
#<script.py> output:
#    0    Fellow-Citizens of the Senate and of the House...
#    1    Fellow Citizens:  I AM again called upon by th...
#    2    WHEN it was first perceived, in early times, t...
#    3    Friends and Fellow-Citizens:  CALLED upon to u...
#    4    PROCEEDING, fellow-citizens, to that qualifica...
#    Name: text, dtype: object
#################################################

# Replace all non letter characters with a whitespace
#speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')

# Change to lower case
#speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
#print(speech_df['text_clean'].head())

#################################################
#<script.py> output:
#    0    fellow citizens of the senate and of the house...
#    1    fellow citizens   i am again called upon by th...
#    2    when it was first perceived  in early times  t...
#    3    friends and fellow citizens   called upon to u...
#    4    proceeding  fellow citizens  to that qualifica...
#    Name: text_clean, dtype: object
#################################################
#now your text strings have been standardized and cleaned up. You
#can now use this new column (text_clean) to extract information
#about the speeches.

In [None]:
#High level text features

#Once the text has been cleaned and standardized you can begin
#creating features from the data. The most fundamental information
#you can calculate about free form text is its size, such as its
#length and number of words. In this exercise (and the rest of this
#chapter), you will focus on the cleaned/transformed text column
#(text_clean) you created in the last exercise.

# Find the length of each text
#speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
#speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
#speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
#print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']].head())

#################################################
#<script.py> output:
#                                               text_clean  char_cnt  word_cnt  avg_word_length
#    0   fellow citizens of the senate and of the house...      8616      1432         6.016760
#    1   fellow citizens   i am again called upon by th...       787       135         5.829630
#    2   when it was first perceived  in early times  t...     13871      2323         5.971158
#    3   friends and fellow citizens   called upon to u...     10144      1736         5.843318
#    4   proceeding  fellow citizens  to that qualifica...     12902      2169         5.948363
#################################################
#These features may appear basic but can be quite useful in ML models.

**Word counts**
___
- text to columns
    - one column per word with word counts for each word
    - CountVectorizer in sklearn.feature_extraction.text
        - min_df, max_df = 0.1, 0.9
        - creates sparse array, convert to array using .toarray()
            - to get feature names from the array .get_feature_names()
        - combine into dataframe
            - pd.DataFrame(cv_tranformed.toarray(), columns=cv.get_feature_names()).add_prefix('Counts_')
        - updating/combining your dataframe
            - pd.concat([list], axis=1, sort=False)
___

In [None]:
#Counting words (I)

#Once high level information has been recorded you can begin creating
#features based on the actual content of each text. One way to do
#this is to approach it in a similar way to how you worked with
#categorical variables in the earlier lessons.

#For each unique word in the dataset a column is created.

#For each entry, the number of times this word occurs is counted and
#the count value is entered into the respective column.

#These "count" columns can then be used to train machine learning
#models.

# Import CountVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
#cv = CountVectorizer()

# Fit the vectorizer
#cv.fit(speech_df['text_clean'])

# Print feature names
#print(cv.get_feature_names())

#################################################
#<script.py> output:
#    ['abandon', 'abandoned', 'abandonment', 'abate', 'abdicated', 'abeyance', 'abhorring', 'abide', 'abiding', 'abilities', 'ability', 'abject', 'able', 'ably', 'abnormal', 'abode', 'abolish', 'abolished', 'abolishing', 'aboriginal', 'aborigines', 'abound', 'abounding', 'abounds', 'about', 'above', 'abraham', 'abreast', 'abridging',
#################################################
#this vectorizer can be applied to both the text it was trained on, and new texts.

In [None]:
#Counting words (II)

#Once the vectorizer has been fit to the data, it can be used to
#transform the text to an array representing the word counts. This
#array will have a row per block of text and a column for each of
#the features generated by the vectorizer that you observed in the
#last exercise.

#The vectorizer to you fit in the last exercise (cv) is available
#in your workspace.

# Apply the vectorizer
#cv_transformed = cv.transform(speech_df['text_clean'])

# Print the full array
#cv_array = cv_transformed.toarray()
#print(cv_array)

#################################################
#<script.py> output:
#    [[0 0 0 ... 0 0 0]
#     [0 0 0 ... 0 0 0]
#     [0 1 0 ... 0 0 0]
#     ...
#     [0 1 0 ... 0 0 0]
#     [0 0 0 ... 0 0 0]
#     [0 0 0 ... 0 0 0]]
#################################################

# Print the shape of cv_array
#print(cv_array.shape)

#################################################
#<script.py> output:
#    (58, 9043)
#################################################
#The speeches have 9043 unique words, which is a lot! In the next
#exercise, you will see how to create a limited set of features.

In [None]:
#Limiting your features

#As you have seen, using the CountVectorizer with its default
#settings creates a feature for every single word in your corpus.
#This can create far too many features, often including ones that
#will provide very little analytical value.

#For this purpose CountVectorizer has parameters that you can set to
#reduce the number of features:

#min_df : Use only words that occur in more than this percentage of documents.
#This can be used to remove outlier words that will not generalize across texts.

#max_df : Use only words that occur in less than this percentage of documents.
#This is useful to eliminate very common words that occur in every corpus without
#adding value such as "and" or "the".

# Import CountVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Specify arguments to limit the number of features generated
#cv = CountVectorizer(min_df=0.2, max_df=0.8)

# Fit, transform, and convert into array
#cv_transformed = cv.fit_transform(speech_df['text_clean'])
#cv_array = cv_transformed.toarray()

# Print the array shape
#print(cv_array.shape)

#################################################
#<script.py> output:
#    (58, 818)
#################################################
# the number of features (unique words) greatly reduced from 9043 to 818.

In [None]:
#Text to DataFrame
#Now that you have generated these count based features in an array
#you will need to reformat them so that they can be combined with
#the rest of the dataset. This can be achieved by converting the
#array into a pandas DataFrame, with the feature names you found
#earlier as the column names, and then concatenate it with the
#original DataFrame.

#The numpy array (cv_array) and the vectorizer (cv) you fit in the
#last exercise are available in your workspace.

# Create a DataFrame with these features
#cv_df = pd.DataFrame(cv_array,
#                     columns=cv.get_feature_names()).add_prefix('Counts_')

# Add the new columns to the original DataFrame
#speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
#print(speech_df_new.head())

#################################################
#<script.py> output:
#                    Name         Inaugural Address                      Date                                               text                                         text_clean  ...  Counts_years  \
#    0  George Washington   First Inaugural Address  Thursday, April 30, 1789  Fellow-Citizens of the Senate and of the House...  fellow citizens of the senate and of the house...  ...             1
#    1  George Washington  Second Inaugural Address     Monday, March 4, 1793  Fellow Citizens:  I AM again called upon by th...  fellow citizens   i am again called upon by th...  ...             0
#    2         John Adams         Inaugural Address   Saturday, March 4, 1797  WHEN it was first perceived, in early times, t...  when it was first perceived  in early times  t...  ...             3
#    3   Thomas Jefferson   First Inaugural Address  Wednesday, March 4, 1801  Friends and Fellow-Citizens:  CALLED upon to u...  friends and fellow citizens   called upon to u...  ...             0
#    4   Thomas Jefferson  Second Inaugural Address     Monday, March 4, 1805  PROCEEDING, fellow-citizens, to that qualifica...  proceeding  fellow citizens  to that qualifica...  ...             2
#
#       Counts_yet  Counts_you  Counts_young  Counts_your
#    0           0           5             0            9
#    1           0           0             0            1
#    2           0           0             0            1
#    3           2           7             0            7
#    4           2           4             0            4
#
#    [5 rows x 826 columns]
#################################################
#With the new features combined with the orginial DataFrame they can
#be now used for ML models or analysis.

**Term frequency-inverse document frequency**
___
- TF-IDF =
    - count of word occurances / Total words in document
        - Divided by
    - log (Number of docs word is in / Total number of documents)
- reduces the weight of common words and increases weights of uncommon words
- from sklearn.feature_extraction.text import TfidfVectorizer
    - max_features - maximum number of columns
    - stop_words - list of common words to omit
___

In [None]:
#Tf-idf

#While counts of occurrences of words can be useful to build models,
#words that occur many times may skew the results undesirably. To
#limit these common words from overpowering your model a form of
#normalization can be used. In this lesson you will be using Term
#frequency-inverse document frequency (Tf-idf) as was discussed in
#the video. Tf-idf has the effect of reducing the value of common
#words, while increasing the weight of words that do not occur in
#many documents.

# Import TfidfVectorizer
#from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
#tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectorizer and transform the data
#tv_transformed = tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
#tv_df = pd.DataFrame(tv_transformed.toarray(),
#                     columns=tv.get_feature_names()).add_prefix('TFIDF_')
#print(tv_df.head())

#################################################
#<script.py> output:
#       TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  TFIDF_americans  ...  TFIDF_war  TFIDF_way  TFIDF_work  TFIDF_world  TFIDF_years
#    0      0.000000              0.133415       0.000000        0.105388              0.0  ...   0.000000   0.060755    0.000000     0.045929     0.052694
#    1      0.000000              0.261016       0.266097        0.000000              0.0  ...   0.000000   0.000000    0.000000     0.000000     0.000000
#    2      0.000000              0.092436       0.157058        0.073018              0.0  ...   0.024339   0.000000    0.000000     0.063643     0.073018
#    3      0.000000              0.092693       0.000000        0.000000              0.0  ...   0.036610   0.000000    0.039277     0.095729     0.000000
#    4      0.041334              0.039761       0.000000        0.031408              0.0  ...   0.094225   0.000000    0.000000     0.054752     0.062817
#
#    [5 rows x 100 columns]
#################################################
#Did you notice that counting the word occurences and calculating
#the Tf-idf weights are very similar? This is one of the reasons
#scikit-learn is very popular, a consistent API.

In [None]:
#Inspecting Tf-idf values

#After creating Tf-idf features you will often want to understand
#what are the most highest scored words for each corpus. This can be
#achieved by isolating the row you want to examine and then sorting
#the scores from high to low.

#The DataFrame from the last exercise (tv_df) is available in your
#workspace.

# Isolate the row to be examined
#sample_row = tv_df.iloc[0]

# Print the top 5 words of the sorted output
#print(sample_row.sort_values(ascending=False).head())

#################################################
#<script.py> output:
#    TFIDF_government    0.367430
#    TFIDF_public        0.333237
#    TFIDF_present       0.315182
#    TFIDF_duty          0.238637
#    TFIDF_citizens      0.229644
#    Name: 0, dtype: float64
#################################################
#Do you think these scores make sense for the corresponding words?

In [None]:
#Transforming unseen data

#When creating vectors from text, any transformations that you
#perform before training a machine learning model, you also need to
#apply on the new unseen (test) data. To achieve this follow the
#same approach from the last chapter: fit the vectorizer only on the
#training data, and apply it to the test data.

#For this exercise the speech_df DataFrame has been split in two:

#train_speech_df: The training set consisting of the first 45 speeches.
#test_speech_df: The test set consisting of the remaining speeches.

# Instantiate TfidfVectorizer
#tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectorizer and transform the data
#tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
#test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
#test_tv_df = pd.DataFrame(test_tv_transformed.toarray(),
#                          columns=tv.get_feature_names()).add_prefix('TFIDF_')
#print(test_tv_df.head())

#################################################
#<script.py> output:
#       TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  TFIDF_authority  ...  TFIDF_war  TFIDF_way  TFIDF_work  TFIDF_world  TFIDF_years
#    0      0.000000              0.029540       0.233954        0.082703         0.000000  ...   0.079050   0.033313    0.000000     0.299983     0.134749
#    1      0.000000              0.000000       0.547457        0.036862         0.000000  ...   0.052851   0.066817    0.078999     0.277701     0.126126
#    2      0.000000              0.000000       0.126987        0.134669         0.000000  ...   0.042907   0.054245    0.096203     0.225452     0.043884
#    3      0.037094              0.067428       0.267012        0.031463         0.039990  ...   0.030073   0.038020    0.235998     0.237026     0.061516
#    4      0.000000              0.000000       0.221561        0.156644         0.028442  ...   0.021389   0.081124    0.119894     0.299701     0.153133
#
#    [5 rows x 100 columns]
#################################################
#the vectorizer should only be fit on the train set, never on your
#test set.

**N-grams**
___
- Bag of words
    - words viewed/analyzed independently
    - valence (positive/negative) is ignored
- ngram_range
    -argument in TfidfVectorizer
    - indicates bigrams, trigrams, etc for more context to be considered
___

In [None]:
#Using longer n-grams

#So far you have created features based on individual words in each
#of the texts. This can be quite powerful when used in a machine
#learning model but you may be concerned that by looking at words
#individually a lot of the context is being ignored. To deal with
#this when creating models you can use n-grams which are sequence
#of n words grouped together. For example:

#bigrams: Sequences of two consecutive words
#trigrams: Sequences of two consecutive words

#These can be automatically created in your dataset by specifying
#the ngram_range argument as a tuple (n1, n2) where all n-grams in
#the n1 to n2 range are included.

# Import CountVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a trigram vectorizer
#cv_trigram_vec = CountVectorizer(max_features=100,
#                                 stop_words='english',
#                                 ngram_range = (3,3))

# Fit and apply trigram vectorizer
#cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
#print(cv_trigram_vec.get_feature_names())

#################################################
#<script.py> output:
# ['ability preserve protect', 'agriculture commerce manufactures',
# 'america ideal freedom', 'amity mutual concession', 'anchor peace home',
# 'ask bow heads', 'best ability preserve', 'best interests country',
# 'bless god bless', 'bless united states', 'chief justice mr',
# 'children children children', 'citizens united states',
# 'civil religious liberty', 'civil service reform', 'commerce united states',
# 'confidence fellow citizens', 'congress extraordinary session', 'constitution does expressly', 'constitution united states', 'coordinate branches government', 'day task people', 'defend constitution united', 'distinction powers granted', 'distinguished guests fellow', 'does expressly say', 'equal exact justice', 'era good feeling', 'executive branch government', 'faithfully execute office', 'fellow citizens assembled', 'fellow citizens called', 'fellow citizens large', 'fellow citizens world', 'form perfect union', 'general welfare secure', 'god bless america', 'god bless god', 'good greatest number', 'government peace war', 'government united states', 'granted federal government', 'great body people', 'great political parties', 'greatest good greatest', 'guests fellow citizens', 'invasion wars powers', 'land new promise', 'laws faithfully executed', 'letter spirit constitution', 'liberty pursuit happiness', 'life liberty pursuit', 'local self government', 'make hard choices', 'men women children', 'mr chief justice', 'mr majority leader', 'mr president vice', 'mr speaker mr', 'mr vice president', 'nation like person', 'new breeze blowing', 'new states admitted', 'north south east', 'oath prescribed constitution', 'office president united', 'passed generation generation', 'peace shall strive', 'people united states', 'physical moral political', 'policy united states', 'power general government', 'preservation general government', 'preservation sacred liberty', 'preserve protect defend', 'president united states', 'president vice president', 'promote general welfare', 'proof confidence fellow', 'protect defend constitution', 'protection great interests', 'reform civil service', 'reserved states people', 'respect individual human', 'right self government', 'secure blessings liberty', 'south east west', 'sovereignty general government', 'states admitted union', 'territories united states', 'thank god bless', 'turning away old', 'united states america', 'united states best', 'united states government', 'united states great', 'united states maintain', 'united states territory', 'vice president mr', 'welfare secure blessings']
#################################################
#ere you can see that by taking sequential word pairings, some
# context is preserved.

In [None]:
#Finding the most common words

#Its always advisable once you have created your features to inspect
#them to ensure that they are as you would expect. This will allow
#you to catch errors early, and perhaps influence what further
#feature engineering you will need to do.

#The vectorizer (cv) you fit in the last exercise and the sparse
#array consisting of word counts (cv_trigram) is available in your
#workspace.

# Create a DataFrame of the features
#cv_tri_df = pd.DataFrame(cv_trigram.toarray(),
#                         columns=cv_trigram_vec.get_feature_names()).add_prefix('Counts_')

# Print the top 5 words in the sorted output
#print(cv_tri_df.sum().sort_values(ascending=False).head())

#################################################
#<script.py> output:
#    Counts_constitution united states    20
#    Counts_people united states          13
#    Counts_preserve protect defend       10
#    Counts_mr chief justice              10
#    Counts_president united states        8
#    dtype: int64
#################################################
#that the most common trigram is constitution united states makes a
#lot of sense for US presidents speeches.

**Wrap-up**
___
- Chapter 1
    - how to understand your data types
    - efficient encoding of categorical features
    - different ways to work with continuous variables
- Chapter 2
    - how to locate gaps in your data
    - best practices in dealing with incomplete rows
    - methods to find and deal with unwanted characters
- Chapter 3
    - how to observe your data's distribution
    - why and how to modify this distribution
    - best practices of finding outliers and their removal
- Chapter 4
    - the foundations of word embeddings
    - usage of Term Frequency Inverse Document Frequency (Tf-idf)
    - n-grams and its advantages over bag of words
- Next steps
    - Kaggle competitions
    - more DataCamp courses
    - your own project
___