**Why generate features?**
___
- Different types of data
    - continuous: integers or floats
    - categorical: one of a limited set of values
    - ordinal: ranked values
    - boolean: true/false values
    - datetime: dates and times
- course structure
    - chapter 1: feature creation and extraction
    - chapter 2: engineering messy data
    - chapter 3: feature normalization
    - chapter 4: working with text features
___

In [None]:
#Getting to know your data

#Pandas is one the most popular packages used to work with tabular
#data in Python. It is generally imported using the alias pd and can
#be used to load a CSV (or other delimited files) using read_csv().

#You will be working with a modified subset of the Stackoverflow
#survey response data in the first three chapters of this course.
#This data set records the details, and preferences of thousands of
#users of the StackOverflow website.

# Import pandas
#import pandas as pd

# Import so_survey_csv into so_survey_df
#so_survey_df = pd.read_csv(so_survey_csv)

# Print the first five rows of the DataFrame
#print(so_survey_df.head())

#################################################
#<script.py> output:
#          SurveyDate                                    FormalEducation  ConvertedSalary Hobby       Country  ...     VersionControl Age  Years Experience  Gender   RawSalary
#    0  2/28/18 20:20           Bachelor's degree (BA. BS. B.Eng.. etc.)              NaN   Yes  South Africa  ...                Git  21                13    Male         NaN
#    1  6/28/18 13:26           Bachelor's degree (BA. BS. B.Eng.. etc.)          70841.0   Yes       Sweeden  ...     Git;Subversion  38                 9    Male   70,841.00
#    2    6/6/18 3:37           Bachelor's degree (BA. BS. B.Eng.. etc.)              NaN    No       Sweeden  ...                Git  45                11     NaN         NaN
#    3    5/9/18 1:06  Some college/university study without earning ...          21426.0   Yes       Sweeden  ...  Zip file back-ups  46                12    Male   21,426.00
#    4  4/12/18 22:41           Bachelor's degree (BA. BS. B.Eng.. etc.)          41671.0   Yes            UK  ...                Git  39                 7    Male  £41,671.00
#
#    [5 rows x 11 columns]
#################################################

# Print the data type of each column
#print(so_survey_df.dtypes)

#################################################
#    SurveyDate                     object
#    FormalEducation                object
#    ConvertedSalary               float64
#    Hobby                          object
#    Country                        object
#    StackOverflowJobsRecommend    float64
#    VersionControl                 object
#    Age                             int64
#    Years Experience                int64
#    Gender                         object
#    RawSalary                      object
#    dtype: object
#################################################

In [None]:
#Selecting specific data types
#Often a data set will contain columns with several different data
#types (like the one you are working with). The majority of machine
#learning models require you to have a consistent data type across
#features. Similarly, most feature engineering techniques are
#applicable to only one type of data at a time. For these reasons
#among others, you will often want to be able to access just the
#columns of certain types when working with a DataFrame.

#The DataFrame (so_survey_df) from the previous exercise is available
#in your workspace.

# Create subset of only the numeric columns
#so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_survey_df_num
#print(so_numeric_df.columns)

#################################################
#<script.py> output:
#    Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age', 'Years Experience'], dtype='object')
#################################################
# In the next lesson, you will learn the most common ways of dealing
#with categorical data.

**Dealing with categorical features**
___
- encoding categorical features
    - one-hot encoding
        - converts n categories into n features
        - explainable features
        - problem of collinearity
    - dummy encoding
        - converts n categories into n-1 features
        -
___

In [None]:
#One-hot encoding and dummy variables

#To use categorical variables in a machine learning model, you first
#need to represent them in a quantitative way. The two most common
#approaches are to one-hot encode the variables using or to use dummy
#variables. In this exercise, you will create both types of encoding,
#and compare the created column sets. We will continue using the same
#DataFrame from previous lesson loaded as so_survey_df and focusing on
#its Country column.

# Convert the Country column to a one hot encoded Data Frame
#one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
#print(one_hot_encoded.columns)

#################################################
#<script.py> output:
#    Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby', 'StackOverflowJobsRecommend', 'VersionControl', 'Age', 'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
#           'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden', 'OH_UK', 'OH_USA', 'OH_Ukraine'],
#          dtype='object')
#################################################

# Create dummy variables for the Country column
#dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
#print(dummy.columns)

#################################################
#<script.py> output:
#    Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby', 'StackOverflowJobsRecommend', 'VersionControl', 'Age', 'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
#           'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK', 'DM_USA', 'DM_Ukraine'],
#          dtype='object')
#################################################
#Did you notice that the column for France was missing when you
#created dummy variables? Now you can choose to use one-hot encoding
#or dummy variables where appropriate.

In [None]:
#Dealing with uncommon categories

#Some features can have many different categories but a very uneven
#distribution of their occurrences. Take for example Data Science's
#favorite languages to code in, some common choices are Python, R,
#and Julia, but there can be individuals with bespoke choices, like
#FORTRAN, C etc. In these cases, you may not want to create a feature
#for each value, but only the more common occurrences.

# Create a series out of the Country column
#countries = so_survey_df['Country']

# Get the counts of each category
#country_counts = countries.value_counts()

# Print the count values for each category
#print(country_counts)

#################################################
#<script.py> output:
#    South Africa    166
#    USA             164
#    Spain           134
#    Sweeden         119
#    France          115
#    Russia           97
#    India            95
#    UK               95
#    Ukraine           9
#    Ireland           5
#    Name: Country, dtype: int64
#################################################

# Create a mask for only categories that occur less than 10 times
#mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
#print(mask.head())

#################################################
#<script.py> output:
#    0    False
#    1    False
#    2    False
#    3    False
#    4    False
#    Name: Country, dtype: bool
#################################################

# Label all other categories as Other
#countries[mask] = 'Other'

# Print the updated category counts
#print(pd.value_counts(countries))

#################################################
#<script.py> output:
#    South Africa    166
#    USA             164
#    Spain           134
#    Sweeden         119
#    France          115
#    Russia           97
#    India            95
#    UK               95
#    Other            14
#    Name: Country, dtype: int64
#################################################
#now you can work with large data sets while grouping low frequency
#categories.

**Numeric variables**
___
- Types of numeric features
    - age
    - price
    - counts
    - geospatial data
___

In [None]:
#Binarizing columns

#While numeric values can often be used without any feature
#engineering, there will be cases when some form of manipulation
#can be useful. For example on some occasions, you might not care
#about the magnitude of a value but only care about its direction,
#or if it exists at all. In these situations, you will want to
#binarize a column. In the so_survey_df data, you have a large
#number of survey respondents that are working voluntarily (without
#pay). You will create a new column titled Paid_Job indicating
#whether each person is paid (their salary is greater than zero).

# Create the Paid_Job column filled with zeros
#so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
#so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
#print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

#################################################
#<script.py> output:
#       Paid_Job  ConvertedSalary
#    0         0              0.0
#    1         1          70841.0
#    2         0              0.0
#    3         1          21426.0
#    4         1          41671.0
#################################################
#binarizing columns can also be useful for your target variables.

In [None]:
#Binning values

#For many continuous values you will care less about the exact value
#of a numeric column, but instead care about the bucket it falls into.
#This can be useful when plotting values, or simplifying your machine
#learning models. It is mostly used on continuous variables where
#accuracy is not the biggest concern e.g. age, height, wages.

#Bins are created using pd.cut(df['column_name'], bins) where bins can
#be an integer specifying the number of evenly spaced bins, or a list
#of bin boundaries.

# Bin the continuous variable ConvertedSalary into 5 bins
#so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 5)

# Print the first 5 rows of the equal_binned column
#print(so_survey_df[['equal_binned', 'ConvertedSalary']].head())

#################################################
#<script.py> output:
#              equal_binned  ConvertedSalary
#    0  (-2000.0, 400000.0]              0.0
#    1  (-2000.0, 400000.0]          70841.0
#    2  (-2000.0, 400000.0]              0.0
#    3  (-2000.0, 400000.0]          21426.0
#    4  (-2000.0, 400000.0]          41671.0
#################################################

# Import numpy
#import numpy as np

# Specify the boundaries of the bins
#bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
#labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
#so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'],
#                                         bins, labels = labels)

# Print the first 5 rows of the boundary_binned column
#print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())

#################################################
#<script.py> output:
#      boundary_binned  ConvertedSalary
#    0        Very low              0.0
#    1          Medium          70841.0
#    2        Very low              0.0
#    3             Low          21426.0
#    4             Low          41671.0
#################################################
#now you can bin columns with equal spacing and predefined boundaries.

**Why do missing values exist?**
___
- How gaps in data occur
    - data not being collected properly
    - collection and management errors
    - data intentionally being omitted
    - could be created due to transformations of the data
- Why we care?
    - some models cannot work with missing data (Nulls/NaNs)
    - missing data may be a sign of a wider data issue
    - missing data can be a useful feature
- pd.info()
- pd.isnull()
- pd.isnull().sum()
- df.notnull()
___

In [None]:
#How sparse is my data?
#Most data sets contain missing values, often represented as NaN
#(Not a Number). If you are working with Pandas you can easily check
#how many missing values exist in each column.

#Let's find out how many of the developers taking the survey chose to
#enter their age (found in the Age column of so_survey_df) and their
#gender (Gender column of so_survey_df).

# Subset the DataFrame
#sub_df = so_survey_df[['Age', 'Gender']]

# Print the number of non-missing values
#print(sub_df.info())

#################################################
#<script.py> output:
#    <class 'pandas.core.frame.DataFrame'>
#    RangeIndex: 999 entries, 0 to 998
#    Data columns (total 2 columns):
#    Age       999 non-null int64
#    Gender    693 non-null object
#    dtypes: int64(1), object(1)
#    memory usage: 15.7+ KB
#    None
#################################################

In [None]:
#Finding the missing values

#While having a summary of how much of your data is missing can be
#useful, often you will need to find the exact locations of these
#missing values. Using the same subset of the StackOverflow data
#from the last exercise (sub_df), you will show how a value can be
#flagged as missing.

# Print the top 10 entries of the DataFrame
#print(sub_df.head(10))

#################################################
#<script.py> output:
#       Age  Gender
#    0   21    Male
#    1   38    Male
#    2   45     NaN
#    3   46    Male
#    4   39    Male
#    5   39    Male
#    6   34    Male
#    7   24  Female
#    8   23    Male
#    9   36     NaN
#################################################

# Print the locations of the missing values
#print(sub_df.head(10).isnull())

#################################################
#<script.py> output:
#         Age  Gender
#    0  False   False
#    1  False   False
#    2  False    True
#    3  False   False
#    4  False   False
#    5  False   False
#    6  False   False
#    7  False   False
#    8  False   False
#    9  False    True
#################################################

# Print the locations of the non-missing values
#print(sub_df.head(10).notnull())

#################################################
#<script.py> output:
#        Age  Gender
#    0  True    True
#    1  True    True
#    2  True   False
#    3  True    True
#    4  True    True
#    5  True    True
#    6  True    True
#    7  True    True
#    8  True    True
#    9  True   False
#################################################
# finding where the missing values exist can often be important.

**Dealing with missing values (I)**
___
- pd.dropna()
- pd.drop()
- random omissions
    - complete case analysis / listwise deletion
    - drawbacks
        - deletes valid data points as well
        - relies on randomness
        - reduces information if a feature is removed (degrees of freedom)
- replacement
- recording missing values
___

In [None]:
#Listwise deletion

#The simplest way to deal with missing values in your dataset when
#they are occurring entirely at random is to remove those rows, also
#called 'listwise deletion'.

#Depending on the use case, you will sometimes want to remove all
#missing values in your data while other times you may want to only
#remove a particular column if too many values are missing in that
#column.

# Print the number of rows and columns
#print(so_survey_df.shape)

#################################################
#<script.py> output:
#    (999, 11)
#################################################

# Create a new DataFrame dropping all incomplete rows
#no_missing_values_rows = so_survey_df.dropna()

# Print the shape of the new DataFrame
#print(no_missing_values_rows.shape)

#################################################
#<script.py> output:
#    (264, 11)
#################################################

# Create a new DataFrame dropping all columns with incomplete rows
#no_missing_values_cols = so_survey_df.dropna(how='any', axis=1)

# Print the shape of the new DataFrame
#print(no_missing_values_cols.shape)

#################################################
#<script.py> output:
#    (999, 7)
#################################################

# Drop all rows where Gender is missing
#no_gender = so_survey_df.dropna(subset=['Gender'])

# Print the shape of the new DataFrame
#print(no_gender.shape)

#################################################
#<script.py> output:
#    (693, 11)
#################################################
#as you can see dropping all rows that contain any missing values
#may greatly reduce the size of your dataset. So you need to think
#carefully and consider several trade-offs when deleting missing
#values.

In [None]:
#Replacing missing values with constants

#While removing missing data entirely maybe a correct approach in
#many situations, this may result in a lot of information being
#omitted from your models.

#You may find categorical columns where the missing value is a valid
#piece of information in itself, such as someone refusing to answer
#a question in a survey. In these cases, you can fill all missing
#values with a new category entirely, for example 'No response given'.

# Print the count of occurrences
#print(so_survey_df['Gender'].value_counts())

#################################################
#<script.py> output:
#    Male                                                                        632
#    Female                                                                 53
#    Female;Male                                                                 2
#    Transgender                                                               2
#    Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
#    Female;Transgender                                                             1
#    Male;Non-binary. genderqueer. or gender non-conforming                         1
#    Non-binary. genderqueer. or gender non-conforming                              1
#    Name: Gender, dtype: int64
#################################################

# Replace missing values
#so_survey_df['Gender'].fillna(value='Not Given', inplace=True)

# Print the count of each value
#print(so_survey_df['Gender'].value_counts())

#################################################
#<script.py> output:
#    Male                                                                         632
#    Not Given                                                                    306
#    Female                                                                        53
#    Female;Male                                                                    2
#    Transgender                                                                    2
#    Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
#    Female;Transgender                                                             1
#    Male;Non-binary. genderqueer. or gender non-conforming                         1
#    Non-binary. genderqueer. or gender non-conforming                              1
#    Name: Gender, dtype: int64
#################################################
#By filling in these missing values you can use the columns in your 
#analyses.