# Feature Engineering

#### Categorical, Text, and Image Features
* Data scientists regularly work with categorical, text, and image data. However, to execute machine learning algorithms on these data types, it's necessary to perform transformations first. 
* Categorical data, such as the neighborhood in which a property is located, does not always work well with the machine learning algorithm you're most interested in using. 
* Linear regression, for example, requires numerical inputs.
* Options include one-hot encoding of categorical data and text and image data feature engineering (important for processes like NLP, which has applications in social media and data mining).
* Featuer engineering with images can be very complex: the simplest of which is just using the pixel values themselves
* HOG: Histogram of Oriented Gradients
   
Feature Engineering: understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. 

* **Feature Engineering:** the act of taking raw data and extracting features for machine learning 
* Most machine learning algorithms work with tabular data.
* Most ML algorithms require their imput data to be represented as a vector or a matrix and many assume that the data is distributed normally

* **Different Types of Data:**
    * **Continuous:** either integers (whole numbers) or floats (decimal values)
    * **Categorical:** one of a limited set of values, e.g. gender, country of birth
    * **Ordinal:** ranked values, often with no detail of distance between them
    * **Boolean:** True/False values
    * **Datetime:** dates and times
* in pandas, "objects" are columns that contain strings
* knowing the types of each column can be very useful if you are performing analysis based on a subset of specific data types. To do this, use: `.select_dtypes()` method and pass a list of relevant data types: `only_ints = df.select_dtypes(include=['int'])`

#### Categorical Variables
* Categorical variables are used to represent groups that are qualitative in nature, like colors, country of birth
* You will need to encode categorical values as numeric values to use them in your machine learning models 
* When categories are unordered (like colors, country of birth), assigned ordered numerical values to them may greatly penalize the effectiveness of your model.
* Thus, you cannot allocate arbitrary numbers to each category, as that would imply some form of ordering to the categories
* $\Rightarrow$ **One Hot Encoding**
* $\Rightarrow$ **Dummy Encoding**
    * Very similar, and often confused
    * by default, pandas performs one hot encoding when you use the get_dummies() function
    * difference:
        * **One Hot Encoding:** converts *n* categories into *n* features
            * `pd.get_dummies(df, columns=['Country'], prefix ='C')`
            * note that specifying a prefix argument can improve readability, especially if the list of column names passed to `columns` contains more than one column.
            * **Use for: generally creating more explainable features**
            * **Note: one must be aware that one-hot encoding may create features that are entirely colinear due to the same information being represented multiple times. 
        * **Dummy Encoding:** creates *n* - 1 features for *n* categories
            * `pd.get_dummies(df, columns=['Coutnry'], drop_first=True, prefix = 'C')`
            * the dropped column (referred to as the *base column* is encoded by the absence of all other features and it's value is represented by the intercept
            * **Use for: Necessary information without duplication.**
            
        * Both one-hot encoding and dummy encoding may result in a **huge** number of columns being created if there are too many different categories in a column 
        * In these cases, you may only want to create columns for the most common values:
            * `counts = df['Country'].value_counts()` # to check occurences of a category value 
            * once you have your counts of column category occurences, you can use it to limit what values you will include by first creating a mask of values that occur less than *n* times:
            * `mask = df['Country'].isin(counts[counts<5].index)`
            * use the mask to replace these categories that occur less frequently with a value of your choice (for example: an umbrella category like 'Other')
            * `df['Country'][mask] = 'Other'`

#### Numeric variables
* Even if your raw data is all numeric, there is still a lot you can do to improve your features 
* Types of numeric features:
    * Age
    * Price
    * Counts
    * Geospacial data (such as coordinates) 
* A few of the considerations and possible feature engineering steps to keep in mind when dealing with numeric data:

* **Is the magnitude of the feature its most important trait, or just its direction?**
    * Can you turn numeric values (for example, number of restaurant health code violations) into binary values (has restaurant ever violated a health code before? yes/no)
    * **Binarizing numeric data:**
    * #Create new column: Binary Violation
    * `df['Binary_Violation'] = 0`
    * `df.loc[df['Number_of_Violations'] > 0, 'Binary_Violation'] = 1`
    
    * **Binning numeric variables:** 
    * Similar to binarizing, but using more than just 2 bins
    * Often useful for variables such as age brackets, income brackets, etc where exact numbers are less relevant than general magnitude of the value 
    * `df['Binned_Group'] = pd.cut(df['Number_of_Violations'], bins=[-np.inf, 0, 2, np.inf], labels =[1,2,3])`
    * note in above code: `bin` arguments represent cut-off points; so, for 3 bins, 4 values are needed
    * bins created using `pd.cut()`
    * **Note:** A new column can be created using `df[column_name] = default_value`

```
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())
```
    
* Bins are created using pd.cut(df['column_name'], bins) where bins can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.

```
# Import numpy
import numpy as np

# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 
                                         bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())
```

## Text mining in Python
* Text Mining is the process of deriving meaningful information from natural language text.
* The overall goal is to turn the texts into data for analysis, via application of Natural Language Processing.

## Natural Language Processing (NLP)
* NLP is a part of computer science and artificial intelligence which deals with human languages
* In other words, NLP is a coponent of text mining that performs a special kind of linguistic analysis that essentially helps a machine "read" text.
* It uses a different methodology to decipher the ambiguities in human language, including:
    * automatic summarization
    * part-of-speech tagging
    * disambiguation
    * chunking
    * natural language understanding and recognition
**First, we need to install the NLTK library that is the natural language toolkit for building Python programs to work with human language data**

### Terminology

* **Tokenization:**
    * the first step in NLP
    * it is the process of breaking strings into tokens which in turn are small structures or units.
    * involves three steps:
        * 1) breaking a complex sentence into words
        * 2) understanding the importance of each word with respect to the sentence
        * 3) produce structural description on an input sentence
* Example input:

```
# Importing necessary library
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus
# sample text for performing tokenization
text = “In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern
side of South America"
# importing word_tokenize from nltk
from nltk.tokenize import word_tokenize
# Passing the string text into word tokenize for breaking the sentences
token = word_tokenize(text)
token
```
* Output:

```
['In','Brazil','they','drive', 'on','the', 'right-hand', 'side', 'of', 'the', 'road', '.', 'Brazil', 'has', 'a', 'large', 'coastline', 'on', 'the', 'eastern', 'side', 'of', 'South', 'America']
```

#### Finding frequency of distinct tokens in the text
* **2** methods:

* Example input, **method 1**:

```
# finding the frequency distinct in the tokens
# Importing FreqDist library from nltk and passing token into FreqDist
from nltk.probability import FreqDist
fdist = FreqDist(token)
fdist
```
* Output: `FreqDist({'the': 3, 'Brazil': 2, 'on': 2, 'side': 2, 'of': 2, 'In': 1, 'they': 1, 'drive': 1, 'right-hand': 1, 'road': 1, ...})`

* Example input, **method 2**:

```
# To find the frequency of top 10 words
fdist1 = fdist.most_common(10)
fdist1
```
* Output:

```
[('the', 3),
 ('Brazil', 2),
 ('on', 2),
 ('side', 2),
 ('of', 2),
 ('In', 1),
 ('they', 1),
 ('drive', 1),
 ('right-hand', 1),
 ('road', 1)]
 ```
 

#### Stemming
* Stemming usually refers to normalizing words into its base form or root form.
* For example: **waiting**, **waited**, **waits** $\Rightarrow$ **wait**
* There are two methods in stemming, namely, **1) Porter Stemming** (removes common morphological and inflectional endings from words) and **2) Lancaster Stemming** (a more aggressive stemming algorithm).

* **method 1:**

```
# Importing Porterstemmer from nltk library
# Checking for the word ‘giving’ 
from nltk.stem import PorterStemmer
pst = PorterStemmer()
pst.stem(“waiting”)
```
* output: `wait`

* **method 2:**

```
# Checking for the list of words
stm = ["waited", "waiting", "waits"]
for word in stm :
   print(word+ ":" +pst.stem(word))
```
* output:

```
waited:wait
waiting:wait
waits:wait
```
* **method 3:**

```
# Importing LancasterStemmer from nltk
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
stm = [“giving”, “given”, “given”, “gave”]
for word in stm :
 print(word+ “:” +lst.stem(word))
``` 
* output:

```
giving:giv
given:giv
given:giv
gave:gav
```

#### Lemmatization
* groups together different inflected forms of a word, called Lemma
* SOmehow similar to Stemming, as it maps several words into one common root
* Output of Lemmatization is a proper word
* For example, a Lemmatize should map 'gone', 'going', and 'went' into 'go'

* **Lemmatization**, in simpler terms is the process of converting a word to it's base form. 
* the difference between stemming and lemmatization is that lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
* For example, lemmatization would correctly identify the base form of 'caring' to 'care', whereas stemming would cutoff the 'ing' part and convert it to 'car'
* Lemmatization can be implemented in python by using Wordnet Lemmatizer, Spacy Lemmatizer, TextBlob, Stanford CoreNLP

* **method:**

```
# Importing Lemmatizer library from nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print(“rocks :”, lemmatizer.lemmatize(“rocks”)) 
print(“corpora :”, lemmatizer.lemmatize(“corpora”))
```
* output:

```
rocks : rock
corpora : corpus
```

#### Stop words
* “Stop words” are the most common words in a language like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all”. These words do not provide any meaning and are usually removed from texts. We can remove these stop words using nltk library

* **method:**

```
# importing stopwors from nltk library
from nltk import word_tokenize
from nltk.corpus import stopwords
a = set(stopwords.words(‘english’))
text = “Cristiano Ronaldo was born on February 5, 1985, in Funchal, Madeira, Portugal.”
text1 = word_tokenize(text.lower())
print(text1)
stopwords = [x for x in text1 if x not in a]
print(stopwords)
```
* output:

```
['cristiano', 'ronaldo', 'was', 'born', 'on', 'february', '5', ',', '1985', ',', 'in', 'funchal', ',', 'madeira', ',', 'portugal', '.']
Output of stopwords:
['cristiano', 'ronaldo', 'born', 'february', '5', ',', '1985', ',', 'funchal', ',', 'madeira', ',', 'portugal', '.']
```

#### Part of speech tagging (POS)
* Part-of-speech tagging is used to assign parts of speech to each word of a given text (such as nouns, verbs, pronouns, adverbs, conjunction, adjectives, interjection)
*  There are many tools available for POS taggers and some of the widely used taggers are NLTK, Spacy, TextBlob, Standford CoreNLP, etc.
* **method:**

```
text = “vote to choose a particular man or a group (party) to represent them in parliament”
#Tokenize the text
tex = word_tokenize(text)
for token in tex:
print(nltk.pos_tag([token]))
```
* output:

```
[('vote', 'NN')]
[('to', 'TO')]
[('choose', 'NN')]
[('a', 'DT')]
[('particular', 'JJ')]
[('man', 'NN')]
[('or', 'CC')]
[('a', 'DT')]
[('group', 'NN')]
[('(', '(')]
[('party', 'NN')]
[(')', ')')]
[('to', 'TO')]
[('represent', 'NN')]
[('them', 'PRP')]
[('in', 'IN')]
[('parliament', 'NN')]
```

#### Named Entity Recognition
* is the process of detecting the named entities such as the person name, the location name, the company name, the quantities and the monetary value.
* **method:**

```
text = “Google’s CEO Sundar Pichai introduced the new Pixel at Minnesota Roi Centre Event”
#importing chunk library from nltk
from nltk import ne_chunk
# tokenize and POS Tagging before doing chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk
```
* output:

```
Tree('S', [Tree('GPE', [('Google', 'NNP')]), ("'s", 'POS'), Tree('ORGANIZATION', [('CEO', 'NNP'), ('Sundar', 'NNP'), ('Pichai', 'NNP')]), ('introduced', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('Pixel', 'NNP'), ('at', 'IN'), Tree('ORGANIZATION', [('Minnesota', 'NNP'), ('Roi', 'NNP'), ('Centre', 'NNP')]), ('Event', 'NNP')])
```

#### Chunking
* Chunking means picking up individual pieces of information and grouping them into bigger pieces. In the context of NLP and text mining, chunking means a grouping of words or tokens into chunks.
* **method:**

```
text = “We saw the yellow dog”
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = “NP: {<DT>?<JJ>*<NN>}” 
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)
```
* output:
`(S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))`


#### Encoding Text
* When you are faced with text data, it will often **not** be tabular
* Data that is not in a predefined form is called **unstructured data**
    * One example of unstructured data is: **free text**
* Before you can leverage text data in an ML model, you must first transform it into a series of columns of numbers or vectors 
* There are many different approaches to achieving this (above): we'll go through some of the most common
* Using Inaugural Speeches dataset:
    * before any text analytics can be performed, you must ensure that the text data is in a format that can be used
    * the body of the text of these speeches contained in 'text' column as single observation per speech
    
* **1)** Most bodies of text will have **non-letter characters**, such as punctuation, that**will need to be removed before analysis.**
    * Use: `.replace()`
    * regex:
        * `[a-zA-Z]` : All letter characters
        * `[^a-zA-Z]` : All non-letter characters
    * `speech_df['text'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')`
 
* **2)** Once you have removed all unwanted characters, you will then want to **standardize the remaining characters in your text so that they are all lowercase.**
    * `speech_df['text'] = speech_df['text'].str.lower()`
        
* Often there is value in the fundamental characteristics of a text
    * **Length of a text**
        * `speech_df['char_cnt'] = speech_df['text'].str.len()`
        * calculate the number of characters in each speech
    * **Word counts**
        * `speech_df['word_cnt'] = speech_df['text'].str.split().str.len()`
    * **Average length of word**
        * `speech_df['avg_word_len'] = speech_df['char_cnt'] / speech_df['word_cnt']`

#### Word Count Representation
* The most common approach to creating features from transformed & cleaned text data, is to create a column for each word and record the number of times each particular word appears in each text.
* this results in a set of columns equal in width to the number of unique words in the dataset with counts filling each entry.
* sklearn already has a built-in function for this with: `CountVectorizer()`

```
from sklearn.feature_extraction import CountVectorizer
cv = CountVectorizer()
```
* creating a column for every word will result in far too many values for analysis. Thankfully, you can specify arguments when initializing your CountVectorizer to limit this.
* for example, you can specify the minimum number of texts that a word must be contained in: `min_df`
    * if a float is given (for example `min_df = 0.1`) then the word must appear in at least this percent of documents
    * this threshold eliminates words that occur so rarely that they would not be useful when generalizing to new texts
    * conversely, `max_df` limits words to only ones that occur below a certain percentage of the data
    * this can be useful to remove words that occur too frequently to be of any value 

```
from sklearn.feature_extraction import CountVectorizer
cv = CountVectorizer(min_df = 0.1, max_df = 0.9)
cv.fit(speech_df['text_clean'])
cv_transformed = cv.transform(speech_df['text_clean'])
```
* The above code outputs a *sparse array* with a row for each text and a column for every word
* to transform this to a *non-sparse array*:
`cv_transformed.toarray()`
* output will be an array with no concept of column names
* to get the names of the features (words) that have been generated, call:
`feature_names = cv.get_feature_names()`
* returns a list of the features (words) generated, in the same order that the columns of the converted array are in

* **Note:** While fitting and transforming separately can be useful, particularly when you need to transform a different dataset than the one that you used to fit the vectorizer, you can accomplish both steps at once using the `.fit_transform()` method
* **Putting it all together:**

`cv_df = pd.DataFrame(cv_transformed.toarray(), columns =cv.get_feature_names())\.add_prefix('Counts_')`

* you can now combine this dataframe with your original dataframe so that they can be used to generate future analytical models 

`speech_df = pd.concat([speech_df, cv_df, axis=1, sort=False)`


#### Tf-Idf (Term Frequency- Inverse Document Frequency)
* While counts of occurences of words can be a good first step towards encoding your text to build models, it has some limitations
* The main issue, is counts will be much higher for very common words that occur across all texts, providing very little value as a distinguishing feature.
* to limit common words like "the" from overpowering your model, some form of normalization can be used 
* one of the most effective approaches to do this is called: **Term Frequency- Inverse Document Frequency**, or **TF-IDF**

\begin{equation}
TF-IDF =
\frac{\frac{count-of-word-occurences}{total-words-in-document}}{log\frac{number-of-docs-word-is-in}{total-number-of-docs}}
\end{equation}

* the effect: reduces the value of common words, while increasing the weights of words that do not occur in many documents 
* to use the TF-IDF Vectorizer

```
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
```
* silimarly to when you worked with the CountVectorizer, where you could limit the number of features created by specifying arguments 
* set `max_features` argument = 100, will only use the top 100 most common words
* `stop_words` are a predefined list of the most common [insert language here] words, such as "and" and "the"
    * you can use sklearn's built-in list, load your own, or use lists provided by other Python libraries

```
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')
tv.fit(train_speech_df['text'])
train_tv_transformed = tv.transform(train_speech_df['text'])
```
* **Note** that here we are fitting and transforming the training data as a subset of the original data

```
train_tv_df = pd.DataFrame(train_tv_transformed.toarray(), columns= tv.get_feature_names()).add_prefix(TFIDF_')
train_speech_df = pd.concat([train_speech_df, train_tv_df], axis = 1, sort = False)
```
* Inspect your transformation:

```
examine_row = train_tv_df.iloc[0]
print(examine_row.sort_values(ascending=False)
```
* **Applying a vectorizer to new data (like the test set):**
* **Preprocess** your test data using the transformations made on the train data **only**.

```
test_tv_transformed = tv.tranform(test_df['text_clean'])

test_tv_diff = pd.DataFrame(test_tv_transformed.toarray(), columns=tv.get_feature_names()).add_prefix('TDIDF_')

test_speech_df = pd.concat([test_speech_df, test_tv_df], axis = 1, sort=False)
```

#### Bag of words and N-grams
* So far we've looked at individual words on their own, without any context
* This approach is called a **bag of words model**, as the words are being treated as if they are being drawn from a bag at random, with no concept of order or grammar
* Individual words *can* lose all their context or meaning when viewed independently 
* Issues with bag of words:
    * Positive meaning: happy
    * Negative meaning: not happy
    * Positive meandin: never not happy
* $\Uparrow$ different meanings of happy depending on context 
* One common method to retain at least some concept of word order in a text is to instead use multiple consecutive words, like pairs (**bi-grams**), or three consecutive words (**tri-grams**).
* This maintains at least some ordering information, while at the same time allowing for the creation of a reasonable set of features
* To leverage N-grams in your own models:
* `ngram_range` parameter = values equal the minimum and maximum length of the ngrams to be included

```
tv_bi_gram_vec = TfidfVectorizer(ngram_range =(2,2))

tv_bi_gram = tv_bi-gram_vec.fit_transform(speech_df['text'])
```
* Create a DataFrame with the Counts features 

```
tv_df = pd.DataFrame(tv_bi_gram.toarray(), columns=tv_bi_gram_vec.get_feature_names().add_prefix('Counts')
tv_sums = tv_df.sum()
```


#### Image Processing and scikit-image


\begin{equation}
\frac{\frac{first}{second}}{\frac{third}{fourth}}
\end{equation}