# Feature Engineering

# Remember to scale data in COV_weather project

#### Categorical, Text, and Image Features
* Data scientists regularly work with categorical, text, and image data. However, to execute machine learning algorithms on these data types, it's necessary to perform transformations first. 
* Categorical data, such as the neighborhood in which a property is located, does not always work well with the machine learning algorithm you're most interested in using. 
* Linear regression, for example, requires numerical inputs.
* Options include one-hot encoding of categorical data and text and image data feature engineering (important for processes like NLP, which has applications in social media and data mining).
* Featuer engineering with images can be very complex: the simplest of which is just using the pixel values themselves
* HOG: Histogram of Oriented Gradients
   
Feature Engineering: understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. 

* **Feature Engineering:** the act of taking raw data and extracting features for machine learning 
* Most machine learning algorithms work with tabular data.
* Most ML algorithms require their imput data to be represented as a vector or a matrix and many assume that the data is distributed normally

* **Different Types of Data:**
    * **Continuous:** either integers (whole numbers) or floats (decimal values)
    * **Categorical:** one of a limited set of values, e.g. gender, country of birth
    * **Ordinal:** ranked values, often with no detail of distance between them
    * **Boolean:** True/False values
    * **Datetime:** dates and times
* in pandas, "objects" are columns that contain strings
* knowing the types of each column can be very useful if you are performing analysis based on a subset of specific data types. To do this, use: `.select_dtypes()` method and pass a list of relevant data types: `only_ints = df.select_dtypes(include=['int'])`

#### Categorical Variables
* Categorical variables are used to represent groups that are qualitative in nature, like colors, country of birth
* You will need to encode categorical values as numeric values to use them in your machine learning models 
* When categories are unordered (like colors, country of birth), assigned ordered numerical values to them may greatly penalize the effectiveness of your model.
* Thus, you cannot allocate arbitrary numbers to each category, as that would imply some form of ordering to the categories
* $\Rightarrow$ **One Hot Encoding**
* $\Rightarrow$ **Dummy Encoding**
    * Very similar, and often confused
    * by default, pandas performs one hot encoding when you use the get_dummies() function
    * difference:
        * **One Hot Encoding:** converts *n* categories into *n* features
            * `pd.get_dummies(df, columns=['Country'], prefix ='C')`
            * note that specifying a prefix argument can improve readability, especially if the list of column names passed to `columns` contains more than one column.
            * **Use for: generally creating more explainable features**
            * **Note: one must be aware that one-hot encoding may create features that are entirely colinear due to the same information being represented multiple times. 
        * **Dummy Encoding:** creates *n* - 1 features for *n* categories
            * `pd.get_dummies(df, columns=['Coutnry'], drop_first=True, prefix = 'C')`
            * the dropped column (referred to as the *base column* is encoded by the absence of all other features and it's value is represented by the intercept
            * **Use for: Necessary information without duplication.**
            
        * Both one-hot encoding and dummy encoding may result in a **huge** number of columns being created if there are too many different categories in a column 
        * In these cases, you may only want to create columns for the most common values:
            * `counts = df['Country'].value_counts()` # to check occurences of a category value 
            * once you have your counts of column category occurences, you can use it to limit what values you will include by first creating a mask of values that occur less than *n* times:
            * `mask = df['Country'].isin(counts[counts<5].index)`
            * use the mask to replace these categories that occur less frequently with a value of your choice (for example: an umbrella category like 'Other')
            * `df['Country'][mask] = 'Other'`

#### Numeric variables
* Even if your raw data is all numeric, there is still a lot you can do to improve your features 
* Types of numeric features:
    * Age
    * Price
    * Counts
    * Geospacial data (such as coordinates) 
* A few of the considerations and possible feature engineering steps to keep in mind when dealing with numeric data:

* **Is the magnitude of the feature its most important trait, or just its direction?**
    * Can you turn numeric values (for example, number of restaurant health code violations) into binary values (has restaurant ever violated a health code before? yes/no)
    * **Binarizing numeric data:**
    * #Create new column: Binary Violation
    * `df['Binary_Violation'] = 0`
    * `df.loc[df['Number_of_Violations'] > 0, 'Binary_Violation'] = 1`
    
    * **Binning numeric variables:** 
    * Similar to binarizing, but using more than just 2 bins
    * Often useful for variables such as age brackets, income brackets, etc where exact numbers are less relevant than general magnitude of the value 
    * `df['Binned_Group'] = pd.cut(df['Number_of_Violations'], bins=[-np.inf, 0, 2, np.inf], labels =[1,2,3])`
    * note in above code: `bin` arguments represent cut-off points; so, for 3 bins, 4 values are needed
    * bins created using `pd.cut()`
    * **Note:** A new column can be created using `df[column_name] = default_value`

```
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())
```
    
* Bins are created using pd.cut(df['column_name'], bins) where bins can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.

```
# Import numpy
import numpy as np

# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 
                                         bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())
```

## Dealing with messy data and missing values 

* Real world data often has noise and ommissions. 
* Many machine learning models cannot work with missing values 
    * for example, if you are performing linear regression, you need a value for every row and column used in your dataset
* Missing data may also be a sign of a wider data issue
* Missing data may also provide data in and of itself and can sometimes be a useful feature 
* Use `.info()` method to have a preliminary look at how complete the data set is
* To find where these missing values exist, `.isnull()`
* Find non-missing values with `.notnull()`

#### Dealing with missing values 
* If you are confident that the missing values in your dataset are occurring at random (and not being intentionally omitted), the most effective and statistically sound approach to dealing with them is called "complete case analysis" or "listwise deletion" 
    * In this method, a record is fully excluded from your model if any of its values are missing
    * **List-wise deletion in Python:**
    * To delete *all* rows with at least one NaN:
        * `df.dropna(how = any)`
    * On the other hand, if you want to delete rows with missing values in only a specific column, use `subset` argument:
        * `df.dropna(subset=['VersionControl'])`
        * pass a *list* of arguments/columns
        
* Drawbacks to list-wise deletion:
    * It deletes valid data points 
    * It relies on randomness
    * It reduces information (specifically, degrees of freedom)

* The most common way to deal with missing values is to replace them:
    * **`.fillna()`**: `df['Version_Control'].fillna(value='None Given', inplace=True)`
    * To use the fillna() method on a specific column, you need to provide the value you want to replace the missing values with
    * In the case of categorical columns, it is common to replace missing values with strings like "Other", "Not Given", etc.
    
* In situations where you believe that the absence or presence of data is more important than the values themselves, you can create a new column that records the absence of data (and then drop the original column). 

```
df['SalaryGiven'] = df['Converted Salary'].notnull()
df.drop(columns=['ConvertedSalary'])
```

```
# Drop all rows where Gender (column) value is missing
no_gender = so_survey_df.dropna(subset = ['Gender'], axis=0)
```

#### Fill continuous missing values
* one of the major issues with list-wise deletion is apparent when you are building a predictive model: you can't delete rows with missing values in the test set.
* **Replacing missing values:** 
    * **Categorical columns:** replace missing values with the most common occurring value or a string that flags missimg values such as 'None.'
    * **Numeric columns:** replace missing values with a "suitable value"
        * A "suitable value" would be a measure of central tendency like mean or median.
        * However, remember: this can lead to biased estimates of the variances and covariances of the features 
        * Similarly, the standard error and test statistics can be incorrectly estimated: So, if these metrics are needed, they should be calculated before the missing values have been filled
        * directly fill missing values:
            * `df['ConvertedSalary']=df['ConvertedSalary'].fillna(df['ConvertedSalary'].mean())`

#### Dealing with other data issues:
* **Dealing with bad characters:**
    * `df['RawSalary'] = df['RawSalary'].str.replace(',','')`
    * `df['RawSalary'] = df['RawSalary'].astype('float')`
    * If attempting to change the data type results in an error, this may indicate that there are additional stray characters which you didn't account for.
    * Instead of manually searching for values with other stray characters, use the `pd.to_numeric()` function:
        * `pd.tonumeric(errors=coerce)`: pandas will convert the column to numeric, but all values that can't be converted to numeric be changed to NaNs
        * You can now use `isna()` as follows:
            * `print(df[coreced_vals.isna()].head())`
* **Chaining methods:**
`so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',','').str.replace('dolsign','').str.replace('£', '').astype('float')`
                


#### Data distributions
* An important consideration before building a machine learning model is to understand what the distribution of your underlying data looks like.
* A lot of algorithms make assumptions about how your data is distributed or how different features interact with each other 
* For example, almost all models, besides tree-based models, require your features to be on the same scale
* Feature engineering can be used to manipulate your data so that it can fit the assumptions of the distribution or at least fit it as closely as possible 
* Almost every model (besides tree-based models) assume that your data is normally distributed
    * 68% of the data lies within 1 standard deviation of the mean
    * 95% lies within 2 standard deviations of the mean
    * 99.7% of the data lies within 3 standard deviations of the mean
* To understand the shape of your own data, you can create histograms of each of the continuous features: `df.hist()`
* To create boxplots: `df[['column1']].boxplot()`
* Pairing distributions: `sns.pairplot(df)`
* Summary statistics: `df.describe()`

#### Scalar and transformations
* Most ML models and algorithms require your data to be on the same scale for them to be effective
* There are many different approaches to scaling data, but the most popular are: **Standardization** and **Min-Max Scaling** (aka **Normalization**)

* **Min-Max Scaling** is when your data is scaled linearly between a minimum and maximum value, often 0 and 1, with 0 corresponding with the lowest value in the column and 1 corresponding with the highest value in the column
    * As it is a linear scaling, while the values will change, the distribution will not
    * To implement min-max scaling in Python:
    
```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[['Age']])
df['normalized_age'] = scaler.transform(df[['Age']])
```
* **Note:** Since this scaler assumes the max value of your data to be the upper boundary of values, new data may produce unforeseen results.

* **Standardization:** As opposed to finding an outer boundary and squeezing everything within it, Standardization instead finds the mean of your data and centers your distribution around it, calculating the number of stds away from the mean each point is. These values (the number of standard deviations) are then used as your new values. This centers your data around zero but technically has no limit to the maximum and minimum values 
    * Standardization in Python:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['Age']])
df['standardized_col'] = scaler.transform(df[['Age']])
```

* Both normalizaton and standardization are types of scalers (ie the data remained in the same shape)
* A **log transformation**, on the other hand, can be used to make highly skewed distributions less skewed (for example if your data has a long tail)
    * Log transformation in Python:

```
from sklearn.preprocessing import PowerTransformer
log = PowerTransformer()
log.fit(df[['ConvertedSalary']])
df['log_ConvertedSalary'] = log.transform(df[['ConvertedSalary']])
```
* Also: **PowerTransformer**



## Text mining in Python
* Text Mining is the process of deriving meaningful information from natural language text.
* The overall goal is to turn the texts into data for analysis, via application of Natural Language Processing.

## Natural Language Processing (NLP)
* NLP is a part of computer science and artificial intelligence which deals with human languages
* In other words, NLP is a coponent of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.
* It uses a different methodology to decipher the ambiguities in human language, including:
    * automatic summarization
    * part-of-speech tagging
    * disambiguation
    * chunking
    * natural language understanding and recognition
**First, we need to install the NLTK library that is the natural language toolkit for building Python programs to work with human language data**

### Terminology

* **Tokenization:**
    * the first step in NLP
    * it is the process of breaking strings into tokens which in turn are small structures or units.
    * involves three steps:
        * 1) breaking a complex sentence into words
        * 2) understanding the importance of each word with respect to the sentence
        * 3) produce structural description on an input sentence
* Example input:

```
# Importing necessary library
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus
# sample text for performing tokenization
text = “In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern
side of South America"
# importing word_tokenize from nltk
from nltk.tokenize import word_tokenize
# Passing the string text into word tokenize for breaking the sentences
token = word_tokenize(text)
token
```
* Output:

```
['In','Brazil','they','drive', 'on','the', 'right-hand', 'side', 'of', 'the', 'road', '.', 'Brazil', 'has', 'a', 'large', 'coastline', 'on', 'the', 'eastern', 'side', 'of', 'South', 'America']
```

#### Finding frequency of distinct tokens in the text
* **2** methods:

* Example input, **method 1**:

```
# finding the frequency distinct in the tokens
# Importing FreqDist library from nltk and passing token into FreqDist
from nltk.probability import FreqDist
fdist = FreqDist(token)
fdist
```
* Output: `FreqDist({'the': 3, 'Brazil': 2, 'on': 2, 'side': 2, 'of': 2, 'In': 1, 'they': 1, 'drive': 1, 'right-hand': 1, 'road': 1, ...})`

* Example input, **method 2**:

```
# To find the frequency of top 10 words
fdist1 = fdist.most_common(10)
fdist1
```
* Output:

```
[('the', 3),
 ('Brazil', 2),
 ('on', 2),
 ('side', 2),
 ('of', 2),
 ('In', 1),
 ('they', 1),
 ('drive', 1),
 ('right-hand', 1),
 ('road', 1)]
 ```
 

#### Stemming
* Stemming usually refers to normalizing words into its base form or root form.
* For example: **waiting**, **waited**, **waits** $\Rightarrow$ **wait**
* There are two methods in stemming, namely, **1) Porter Stemming** (removes common morphological and inflectional endings from words) and **2) Lancaster Stemming** (a more aggressive stemming algorithm).

* **method 1:**

```
# Importing Porterstemmer from nltk library
# Checking for the word ‘giving’ 
from nltk.stem import PorterStemmer
pst = PorterStemmer()
pst.stem(“waiting”)
```
* output: `wait`

* **method 2:**

```
# Checking for the list of words
stm = ["waited", "waiting", "waits"]
for word in stm :
   print(word+ ":" +pst.stem(word))
```
* output:

```
waited:wait
waiting:wait
waits:wait
```
* **method 3:**

```
# Importing LancasterStemmer from nltk
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
stm = [“giving”, “given”, “given”, “gave”]
for word in stm :
 print(word+ “:” +lst.stem(word))
``` 
* output:

```
giving:giv
given:giv
given:giv
gave:gav
```

#### Lemmatization
* groups together different inflected forms of a word, called Lemma
* SOmehow similar to Stemming, as it maps several words into one common root
* Output of Lemmatization is a proper word
* For example, a Lemmatize should map 'gone', 'going', and 'went' into 'go'

* **Lemmatization**, in simpler terms is the process of converting a word to it's base form. 
* the difference between stemming and lemmatization is that lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
* For example, lemmatization would correctly identify the base form of 'caring' to 'care', whereas stemming would cutoff the 'ing' part and convert it to 'car'
* Lemmatization can be implemented in python by using Wordnet Lemmatizer, Spacy Lemmatizer, TextBlob, Stanford CoreNLP

* **method:**

```
# Importing Lemmatizer library from nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print(“rocks :”, lemmatizer.lemmatize(“rocks”)) 
print(“corpora :”, lemmatizer.lemmatize(“corpora”))
```
* output:

```
rocks : rock
corpora : corpus
```

#### Stop words
* “Stop words” are the most common words in a language like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all”. These words do not provide any meaning and are usually removed from texts. We can remove these stop words using nltk library

* **method:**

```
# importing stopwors from nltk library
from nltk import word_tokenize
from nltk.corpus import stopwords
a = set(stopwords.words(‘english’))
text = “Cristiano Ronaldo was born on February 5, 1985, in Funchal, Madeira, Portugal.”
text1 = word_tokenize(text.lower())
print(text1)
stopwords = [x for x in text1 if x not in a]
print(stopwords)
```
* output:

```
['cristiano', 'ronaldo', 'was', 'born', 'on', 'february', '5', ',', '1985', ',', 'in', 'funchal', ',', 'madeira', ',', 'portugal', '.']
Output of stopwords:
['cristiano', 'ronaldo', 'born', 'february', '5', ',', '1985', ',', 'funchal', ',', 'madeira', ',', 'portugal', '.']
```

#### Part of speech tagging (POS)
* Part-of-speech tagging is used to assign parts of speech to each word of a given text (such as nouns, verbs, pronouns, adverbs, conjunction, adjectives, interjection)
*  There are many tools available for POS taggers and some of the widely used taggers are NLTK, Spacy, TextBlob, Standford CoreNLP, etc.
* **method:**

```
text = “vote to choose a particular man or a group (party) to represent them in parliament”
#Tokenize the text
tex = word_tokenize(text)
for token in tex:
print(nltk.pos_tag([token]))
```
* output:

```
[('vote', 'NN')]
[('to', 'TO')]
[('choose', 'NN')]
[('a', 'DT')]
[('particular', 'JJ')]
[('man', 'NN')]
[('or', 'CC')]
[('a', 'DT')]
[('group', 'NN')]
[('(', '(')]
[('party', 'NN')]
[(')', ')')]
[('to', 'TO')]
[('represent', 'NN')]
[('them', 'PRP')]
[('in', 'IN')]
[('parliament', 'NN')]
```

#### Named Entity Recognition
* is the process of detecting the named entities such as the person name, the location name, the company name, the quantities and the monetary value.
* **method:**

```
text = “Google’s CEO Sundar Pichai introduced the new Pixel at Minnesota Roi Centre Event”
#importing chunk library from nltk
from nltk import ne_chunk
# tokenize and POS Tagging before doing chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk
```
* output:

```
Tree('S', [Tree('GPE', [('Google', 'NNP')]), ("'s", 'POS'), Tree('ORGANIZATION', [('CEO', 'NNP'), ('Sundar', 'NNP'), ('Pichai', 'NNP')]), ('introduced', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('Pixel', 'NNP'), ('at', 'IN'), Tree('ORGANIZATION', [('Minnesota', 'NNP'), ('Roi', 'NNP'), ('Centre', 'NNP')]), ('Event', 'NNP')])
```

#### Chunking
* Chunking means picking up individual pieces of information and grouping them into bigger pieces. In the context of NLP and text mining, chunking means a grouping of words or tokens into chunks.
* **method:**

```
text = “We saw the yellow dog”
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = “NP: {<DT>?<JJ>*<NN>}” 
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)
```
* output:
`(S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))`


#### Encoding Text
* When you are faced with text data, it will often **not** be tabular
* Data that is not in a predefined form is called **unstructured data**
    * One example of unstructured data is: **free text**
* Before you can leverage text data in an ML model, you must first transform it into a series of columns of numbers or vectors 
* There are many different approaches to achieving this (above): we'll go through some of the most common
* Using Inaugural Speeches dataset:
    * before any text analytics can be performed, you must ensure that the text data is in a format that can be used
    * the body of the text of these speeches contained in 'text' column as single observation per speech
    
* **1)** Most bodies of text will have **non-letter characters**, such as punctuation, that**will need to be removed before analysis.**
    * Use: `.replace()`
    * regex:
        * `[a-zA-Z]` : All letter characters
        * `[^a-zA-Z]` : All non-letter characters
    * `speech_df['text'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')`
 
* **2)** Once you have removed all unwanted characters, you will then want to **standardize the remaining characters in your text so that they are all lowercase.**
    * `speech_df['text'] = speech_df['text'].str.lower()`
        
* Often there is value in the fundamental characteristics of a text
    * **Length of a text**
        * `speech_df['char_cnt'] = speech_df['text'].str.len()`
        * calculate the number of characters in each speech
    * **Word counts**
        * `speech_df['word_cnt'] = speech_df['text'].str.split().str.len()`
    * **Average length of word**
        * `speech_df['avg_word_len'] = speech_df['char_cnt'] / speech_df['word_cnt']`

#### Word Count Representation
* The most common approach to creating features from transformed & cleaned text data, is to create a column for each word and record the number of times each particular word appears in each text.
* this results in a set of columns equal in width to the number of unique words in the dataset with counts filling each entry.
* sklearn already has a built-in function for this with: `CountVectorizer()`

```
from sklearn.feature_extraction import CountVectorizer
cv = CountVectorizer()
```
* creating a column for every word will result in far too many values for analysis. Thankfully, you can specify arguments when initializing your CountVectorizer to limit this.
* for example, you can specify the minimum number of texts that a word must be contained in: `min_df`
    * if a float is given (for example `min_df = 0.1`) then the word must appear in at least this percent of documents
    * this threshold eliminates words that occur so rarely that they would not be useful when generalizing to new texts
    * conversely, `max_df` limits words to only ones that occur below a certain percentage of the data
    * this can be useful to remove words that occur too frequently to be of any value 

```
from sklearn.feature_extraction import CountVectorizer
cv = CountVectorizer(min_df = 0.1, max_df = 0.9)
cv.fit(speech_df['text_clean'])
cv_transformed = cv.transform(speech_df['text_clean'])
```
* The above code outputs a *sparse array* with a row for each text and a column for every word
* to transform this to a *non-sparse array*:
`cv_transformed.toarray()`
* output will be an array with no concept of column names
* to get the names of the features (words) that have been generated, call:
`feature_names = cv.get_feature_names()`
* returns a list of the features (words) generated, in the same order that the columns of the converted array are in

* **Note:** While fitting and transforming separately can be useful, particularly when you need to transform a different dataset than the one that you used to fit the vectorizer, you can accomplish both steps at once using the `.fit_transform()` method
* **Putting it all together:**

`cv_df = pd.DataFrame(cv_transformed.toarray(), columns =cv.get_feature_names())\.add_prefix('Counts_')`

* you can now combine this dataframe with your original dataframe so that they can be used to generate future analytical models 

`speech_df = pd.concat([speech_df, cv_df, axis=1, sort=False)`


#### Tf-Idf (Term Frequency- Inverse Document Frequency)
* While counts of occurences of words can be a good first step towards encoding your text to build models, it has some limitations
* The main issue, is counts will be much higher for very common words that occur across all texts, providing very little value as a distinguishing feature.
* to limit common words like "the" from overpowering your model, some form of normalization can be used 
* one of the most effective approaches to do this is called: **Term Frequency- Inverse Document Frequency**, or **TF-IDF**

\begin{equation}
TF-IDF =
\frac{\frac{count-of-word-occurences}{total-words-in-document}}{log\frac{number-of-docs-word-is-in}{total-number-of-docs}}
\end{equation}

* the effect: reduces the value of common words, while increasing the weights of words that do not occur in many documents 
* to use the TF-IDF Vectorizer

```
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
```
* silimarly to when you worked with the CountVectorizer, where you could limit the number of features created by specifying arguments 
* set `max_features` argument = 100, will only use the top 100 most common words
* `stop_words` are a predefined list of the most common [insert language here] words, such as "and" and "the"
    * you can use sklearn's built-in list, load your own, or use lists provided by other Python libraries

```
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')
tv.fit(train_speech_df['text'])
train_tv_transformed = tv.transform(train_speech_df['text'])
```
* **Note** that here we are fitting and transforming the training data as a subset of the original data

```
train_tv_df = pd.DataFrame(train_tv_transformed.toarray(), columns= tv.get_feature_names()).add_prefix(TFIDF_')
train_speech_df = pd.concat([train_speech_df, train_tv_df], axis = 1, sort = False)
```
* Inspect your transformation:

```
examine_row = train_tv_df.iloc[0]
print(examine_row.sort_values(ascending=False)
```
* **Applying a vectorizer to new data (like the test set):**
* **Preprocess** your test data using the transformations made on the train data **only**.

```
test_tv_transformed = tv.tranform(test_df['text_clean'])

test_tv_diff = pd.DataFrame(test_tv_transformed.toarray(), columns=tv.get_feature_names()).add_prefix('TDIDF_')

test_speech_df = pd.concat([test_speech_df, test_tv_df], axis = 1, sort=False)
```

#### Bag of words and N-grams
* So far we've looked at individual words on their own, without any context
* This approach is called a **bag of words model**, as the words are being treated as if they are being drawn from a bag at random, with no concept of order or grammar
* Individual words *can* lose all their context or meaning when viewed independently 
* Issues with bag of words:
    * Positive meaning: happy
    * Negative meaning: not happy
    * Positive meandin: never not happy
* $\Uparrow$ different meanings of happy depending on context 
* One common method to retain at least some concept of word order in a text is to instead use multiple consecutive words, like pairs (**bi-grams**), or three consecutive words (**tri-grams**).
* This maintains at least some ordering information, while at the same time allowing for the creation of a reasonable set of features
* To leverage N-grams in your own models:
* `ngram_range` parameter = values equal the minimum and maximum length of the ngrams to be included

```
tv_bi_gram_vec = TfidfVectorizer(ngram_range =(2,2))

tv_bi_gram = tv_bi-gram_vec.fit_transform(speech_df['text'])
```
* Create a DataFrame with the Counts features 

```
tv_df = pd.DataFrame(tv_bi_gram.toarray(), columns=tv_bi_gram_vec.get_feature_names().add_prefix('Counts')
tv_sums = tv_df.sum()
```


#### Image Processing and scikit-image
* image pre-processing has become a highly valuable skill, applicable in many use cases.
* learn to process, transform, and manipulate images at your will, even when they come in thousands. 
* restore damaged images, perform noise reduction, smart-resize images, count the number of dots on a dice, apply facial detection, and much more, using scikit-image. 
* Extract data, transform and analyze images using NumPy and Scikit-image
* With just a few lines of code, you will convert RGB images to grayscale, get data from them, obtain histograms containing very useful information, and separate objects from the background

#### Scikit-image
* **Image processing** is a method to perform operations on images and videos in order to:
    * Enhance them
    * Extract useful information
    * Analyze it and make decisions
    
* By quantifying the information in images, we can make calculations
* Image processing is a subset of computer vision
* **Wide range of applications:**
    * Medical image analysis
    * Artificial intelligence
    * Image restoration and enhancement 
    * Geospatial computing
    * Surveillance
    * Robotic vision
    * Automotive safety
    * And many more...
* **Purposes** of image processing:
    * 1) **Visualization:**
        * Observe objects that are not visible
    * 2) **Image sharpening and restoration:**
        * Create a better image
    * 3) **Image retrieval:**
        * Seek an image of interest
    * 4) **Measurement of pattern:**
        * Measures various objects
    * 5) **Image recognition:**
        * Distinguish objects in an image.

* **scikit-image** is an image-processing library in python:
    * easy to use
    * makes use of ML
    * out of the box algorithms
* What is an image?
    * A digital image is an array or matrix of square pixels (picture elements) arranged in columns and rows: in other words, a 2-dimensional matrix.
    * These pixels contain information about color and intensity. 
* There are some testing-purpose images provided by scikit-image, in a module called data. 
* 2-dimensional color images are often represented in RGB—3 layers of 2-dimensional arrays, where the three layers represent Red, Green and Blue channels of the image.
* Grayscale images only have shades of black and white. Often, the grayscale intensity is stored as an 8-bit integer giving 256 possible different shades of gray. Grayscale images don't have any color information.
* RGB images have three color channels, while grayscaled ones have a single channel.

* Convert RGB to Grayscale or Grayscale to RGB with:

```
from skimage import color
grayscale = color.rgb2gray(original)
rgb = color.gray2rgb(grayscale)
```


To display images using matplotlib:

In [4]:
def show_image(image, title='Image', cmap_type='gray'):
    plt.imshow(image, cmap= cmap_type)
    plt.title(title)
    plt.axis('off')
    plt.show()

#### Numpy for images
* With NumPy, we can practice simple image processing techniques, such as flipping images, extracting features, and analyzing them.
* Because images can be represented by NumPy multi-dimensional arrays (or "NdArrays"), NumPy methods for manipulating arrays work well on these images.
* Remember that a color image is a NumPy array with athird dimension for color channels. We can slice the multidimensional array and obtain these channels separately.
* To see the individual color intensities along the image:

```
# Obtain the ____ [red, blue, or green] values of the image 
# Keep the values of the height and width of the pixels, select only the desired layer of color:
red = image[:, :, 0]
green = image[:, :, 1]
blue = image[:, :, 2]
```
* Just like with NumPy arrays, we can get the shape of images. This Madrid picture is 426 pixels high and 640 pixels wide.
* It has three layers for color representation: it's an RGB-3 image. So it has shape of (426, 640, 3): `madrid_image.shape`, and a total number of pixels of 817920: `madrid_image.size`
* We can flip the image vertically by using the `np.flipud()` method.
` seville_vertical_flip = np.flipud(flipped_seville)`
* You can flip the image horizontally using the `np.fliplr()` method.
`seville_horizontal_flip = np.fliplr(seville_vertical_flip)`
* Use the `show_image()` function to display an image.
`show_image(seville_horizontal_flip, 'Seville')`
* The histogram of an image is a graphical representation of the amount of pixels of each intensity value. From 0 (pure black) to 255(pure white). 
* We can also create histograms from RGB-3 colored images. In this case each channel: red, green and blue will have a corresponding histogram.
* We can learn a lot about an image by just looking at its histogram. Histograms are used to **threshold images**, to **alter brightness and contrast**, and to **equalize** an image
* Matplotlib has a histogram method. It takes an input array (frequency) and bins as parameters. The successive elements in bin array act as the boundary of each bin. We obtain the red color channel of the image by slicing it:
`red = image[:, :, 0]` 
* We then use the histogram function:
`plt.hist(red.ravel(), bins=256)`
* Use `.ravel()` to return a continuous flattened array from the color values of the image, in this case red. And pass this ravel and the bins as parameters. We set bins to 256 because we'll show the number of pixels for every pixel value, that is, from 0 to 255. Meaning you need 256 values to show the histogram.

#### Getting started with thresholding
* Thresholding is used to partition the background and foreground of grayscale images, by essentially making them black and white. 
* We compare each pixel to a given threshold value. 
* If the pixel is less than that value, we turn it white. If it's greater, we turn it black.
* **Thresholding** is the simplest method of image segmentation.
* Thresholding lets us isolate elements and is used in: object detection, facial recognition, etc
    * works best in high contrast grayscale images
* To threshold color images, we must first convert them to grayscale:

```
# Set optimal threshold value:
thresh = 127 # midpoint between 0 and 255
# Apply thresholding to the image
binary = image > thresh
# Show original and thresholded
show_image(image, 'Original')
show_image(binary, 'Thresholded')
```
* **Inverted thresholding = inverting the color**

```
# Set optimal threshold value:
thresh = 127 # midpoint between 0 and 255
# Apply thresholding to the image
inverted_binary = image <= thresh
# Show original and thresholded
show_image(image, 'Original')
show_image(inverted_binary, 'Inverted thresholded')
```
* There are 2 categories of thresholding in scikit-image:
    * **Global or Histogram-based:** good for uniform backgrounds
    * **Local or adaptive:** for uneven background illumination (background is not easily differentiated)
        * Note that local is slower than global thresholding.
* scikit-image includes a function that evaluates several global algorithms, so that you can choose the one that gives you best results: the `try_all_threshold()` function from `filters` module.
    
```
from skimage.filters import try_all_threshold
# Obtain all the resulting images
fig, ax = try_all_threshold(image, verbose = False) #v=False so it doesn't print function name for each method
# Show resulting plots
show_plt(fig, ax)
```
* It will use seven global algorithms
* When the background of an image seems uniform, global thresholding works best. Previously, we arbitrarily set the thresh value, but we can also calculate the optimal value. 
* For that we import the `threshold_otsu()` function from filters module. Then obtain the optimal global thresh value by calling this function. Apply the local thresh to the image.

```
# Import the otsu threshold function
from skimage.filters import threshold_otsu
# Obtain the optimal threshold value
thresh = threshold_otsu(image)
# Apply thresholding to the image 
binary_global = image > thresh
```
* the optimal thresh is spotted by a red line in the histogram of the image.
* **Local threshold:**
* If the image doesn't have high contrast or the background is uneven, local thresholding produces better results. 
* Import `threshold_local()`, also from filters. 
* With this function, we calculate thresholds in small pixel regions surrounding each pixel we are binarizing. So we need to specify a block_size to surround each pixel; also known as local neighborhoods. 
* And an optional offset, that's a constant subtracted from the mean of blocks to calculate the local threshold value.
* Here in the threshold_local function we set a block_size of 35 pixels and an offset of 10. Then apply that local thresh.

```
#Import the local threshold function
from skimage.filters import threshold_local 
#Set the block size to 35
block_size = 35
#Obtain the optimal local thresholding
local_thresh = threshold_local(text_image, block_size, offset = 10)
#Apply local thresholding and obtain the binary image
binary_local = text_image > local_thresh
```

```
# Import the otsu threshold function
from skimage.filters import threshold_otsu

# Make the image grayscale using rgb2gray
chess_pieces_image_gray = rgb2gray(chess_pieces_image)

# Obtain the optimal threshold value with otsu
thresh = threshold_otsu(chess_pieces_image_gray)

# Apply thresholding to the image
binary = chess_pieces_image_gray > thresh

# Show the image
show_image(binary, 'Binary image')
```

### HOG- Histogram of Oriented Gradients
* What is a **feature descriptor**?
    * It is a simplified representation of the image that contains only the most important information about the image.
* There are a number of feature descriptors out there:
    * **HOG: Histogram of Oriented Gradients**
    * **SIFT: Scale Invariant Feature Transform**
    * **SURF: Speeded-Up Robust Feature**
    
* **HOG**, or **Histogram of Oriented Gradients**, is a feature descriptor that is often used to extract features from image data. It is widely used in computer vision tasks for object detection. Some important aspects of HOG that makes it different from other feature descriptors:
    * The HOG descriptor focuses on the structure or the shape of an object. Now you might ask, how is this different from the edge features we extract for images? In the case of edge features, we only identify if the pixel is an edge or not. HOG is able to provide the edge direction as well. This is done by extracting the gradient and orientation (or you can say magnitude and direction) of the edges
    * Additionally, these orientations are calculated in ‘localized’ portions. This means that the complete image is broken down into smaller regions and for each region, the gradients and orientation are calculated.
    * Finally the HOG would generate a Histogram for each of these regions separately. The histograms are created using the gradients and orientations of the pixel values, hence the name ‘Histogram of Oriented Gradients’
* To put a formal definition to this: The HOG feature descriptor counts the occurrences of gradient orientation in localized portions of an image.

* **Process of Calculating the Histogram of Oriented Gradients (HOG):**
    * **Step 1: Preprocess the Data:**
        * We need to preprocess the image and bring down the width to height ratio to 1:2. The image size should preferably be 64 x 128. 
        * This is because we will be dividing the image into 8*8 and 16*16 patches to extract the features. 
        * Having the specified size (64 x 128) will make all our calculations pretty simple.
    * **Step 2: Calculating Gradients (direction x and y):**
        * The next step is to calculate the gradient for every pixel in the image. 
        * Gradients are the small change in the x and y directions.
        * Take a small patch from the image and calculate the gradients on that
        * get the pixel values for this patch.
        * generate a pixel matrix for the given patch
        * to determine the gradient (or change) in the x-direction, we need to subtract the value on the left from the pixel value on the right.
        * Similarly, to calculate the gradient in the y-direction, we will subtract the pixel value below from the pixel value above the selected pixel.
        * This process will give us two new matrices – one storing gradients in the x-direction and the other storing gradients in the y direction. This is similar to using a Sobel Kernel of size 1. The magnitude would be higher when there is a sharp change in intensity, such as around the edges.
        * We have calculated the gradients in both x and y direction separately. The same process is repeated for all the pixels in the image. 
    * **Step 3: Calculate the Magnitude and Orientation:**
        * Using the gradients we calculated in the last step, we will now determine the magnitude and direction for each pixel value. For this step, we will be using the Pythagoras theorem
        * The orientation comes out to be 36 when we plug in the values. So now, for every pixel value, we have the total gradient (magnitude) and the orientation (direction). We need to generate the histogram using these gradients and orientations.
    * **Step 4: Calculate Histogram of Gradients in 8×8 cells (9×1):**
        * The histograms created in the HOG feature descriptor are not generated for the whole image. Instead, the image is divided into 8×8 cells, and the histogram of oriented gradients is computed for each cell.
        * By doing so, we get the features (or histogram) for the smaller patches which in turn represent the whole image. We can certainly change this value here from 8 x 8 to 16 x 16 or 32 x 32.
        * Once we have generated the HOG for the 8×8 patches in the image, the next step is to normalize the histogram.
    * **Step 5: Normalize gradients in 16×16 cell (36×1):**
        * Although we already have the HOG features created for the 8×8 cells of the image, the gradients of the image are sensitive to the overall lighting. This means that for a particular picture, some portion of the image would be very bright as compared to the other portions.
        * We cannot completely eliminate this from the image. But we can reduce this lighting variation by normalizing the gradients by taking 16×16 blocks.
        * To normalize this matrix, we will divide each of these values by the square root of the sum of squares of the values.
    * **Step 6: Step 6: Features for the complete image:**
```
#importing required libraries
from skimage.io import imread, imshow
from skimage.transform import resize
from skimage.feature import hog
from skimage import exposure
import matplotlib.pyplot as plt
%matplotlib inline

#reading the image
img = imread('puppy.jpeg')
imshow(img)
print(img.shape)
```
* We can see that the shape of the image is 663 x 459. We will have to resize this image into 64 x 128. Note that we are using skimage which takes the input as height x width.

```
#resizing image 
resized_img = resize(img, (128,64)) 
imshow(resized_img) 
print(resized_img.shape)

#creating hog features 
fd, hog_image = hog(resized_img, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(2, 2), visualize=True, multichannel=True)
```
* a basic idea of what each of these hyperparameters represents:
    * The `orientations` are the number of buckets we want to create. Since I want to have a 9 x 1 matrix, I will set the orientations to 9 
    * `pixels_per_cell` defines the size of the cell for which we create the histograms. In the example we covered in this article, we used 8 x 8 cells and here I will set the same value. As mentioned previously, you can choose to change this value 
    * We have another hyperparameter `cells_per_block` which is the size of the block over which we normalize the histogram. Here, we mention the cells per blocks and not the number of pixels. So, instead of writing 16 x 16, we will use 2 x 2 here

* As expected, we have 3,780 features for the image and this verifies the calculations we did in step 7 earlier. You can choose to change the values of the hyperparameters and that will give you a feature matrix of different sizes.

```
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8), sharex=True, sharey=True) 

ax1.imshow(resized_img, cmap=plt.cm.gray) 
ax1.set_title('Input image') 

# Rescale histogram for better display 
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10)) 

ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray) 
ax2.set_title('Histogram of Oriented Gradients')

plt.show()
```


#### Different Methods to Create Histograms using Gradients and Orientation:
* **Method 1:**
    * Let us start with the simplest way to generate histograms. We will take each pixel value, find the orientation of the pixel and update the frequency table.
    * Here is the process for the highlighted pixel (85). Since the orientation for this pixel is 36, we will add a number against angle value 36, denoting the frequency.
    * The same process is repeated for all the pixel values, and we end up with a frequency table that denotes angles and the occurrence of these angles in the image. This frequency table can be used to generate a histogram with angle values on the x-axis and the frequency on the y-axis.
* **Method 2:**
    * This method is similar to the previous method, except that here we have a bin size of 20. So, the number of buckets we would get here is 9.
* **Method 3:**
    * The above two methods use only the orientation values to generate histograms and do not take the gradient value into account. 
    * Here is another way in which we can generate the histogram – instead of using the frequency, we can use the gradient magnitude to fill the values in the matrix. 
* **Method 4:**
    * Let’s make a small modification to the above method. Here, we will add the contribution of a pixel’s gradient to the bins on either side of the pixel gradient. Remember, the higher contribution should be to the bin value which is closer to the orientation.
    * 

\begin{equation}
\frac{\frac{first}{second}}{\frac{third}{fourth}}
\end{equation}