<a href="https://colab.research.google.com/github/abdallaRml/lu/blob/master/Copy_of_Copy_of_Step_4_1_Handling_Categorical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#*** Fundamentals of Machine Learning***

---



# Scikit-learn: 
In 2007, David Cournapeau developed Scikit-learn as part of the Google
summer of code project. INRIA got involved in 2010 and beta v0.1 was released to the
public. Currently, there are more than 700 active contributors, and paid sponsorship from
INRIA, Python Software Foundation, Google, and Tinyclues. Many of the functions of Scikitlearn
are built upon the SciPy (Scientific Python) library, and it provides a great breadth of
efficiently implemented, essential, supervised and unsupervised learning algorithms.

# Machine Learning Perspective of Data
Data is the facts and figures (can also be referred to as raw data) that we have available
with respect to the business context. Data is made up of two aspects:
1. Objects such as people, tree, animals, etc.
2. Attributes that were recorded for objects such as age, size, weight,
cost, etc.

At a high level there are two types of variables based on the type of values they
can take:
1. Continuous quantitative: Variables can take any positive or
negative numerical value within a large range. Retail sales amount
and insurance claim amount are examples for a continuous
variable that can take any number within a large range. These
types of variables are also generally called numerical variables.
2. Discrete or qualitative: Variables can take only particular values.
Retail store location area, state, and city are examples for the
discrete variable, as it can take only one particular value for a
store (here “store” is our object). These types of variables are also
known as categorical variables.


Scales of Measurement
In general, variables can be measured on four different scales 

nominal, 

ordinal, 

interval,

ratio. 

Mean, median, and mode are the way to understand the central tendency—
the middle point—of data distribution. Standard deviation, variance, and range are the
most commonly used dispersion measures used to understand the spread of the data.

### Handling categorical data

Most of the ML libraries are designed to work well with numerical variables. So categorical variables in their original form of text description can’t be directly used for model building. Let’s learn some of the common methods of handling categorical data based on their number of levels.


In [1]:
import random
random.seed(2017)
import pandas as pd
from patsy import dmatrices

df = pd.DataFrame({'A': ['high', 'medium', 'low'],
                   'B': [10,20,30]},
                    index=[0, 1, 2])
                   
print(df)

        A   B
0    high  10
1  medium  20
2     low  30


Create dummy variable:

This is a Boolean variable that indicates the presence of a category with the value 1 and 0 for absence. You should create k-1 dummy variables, where k is the number of level. Scikit-learn provides a useful function, One Hot Encoder, to create a dummy variable for a given categorical variable.

Scikit-learn provides a useful function, One Hot Encoder,
to create a dummy variable for a given categorical variable

In [2]:
df_with_dummies= pd.get_dummies(df, prefix='A', columns=['A'])

print(df_with_dummies)

    B  A_high  A_low  A_medium
0  10       1      0         0
1  20       0      0         1
2  30       0      1         0


### Convert categories to numeric labels

Another simple method is to represent the text description of each level with a number by using ‘Label Encoder’ function of Scikit-learn. If the number of levels are high (example zip code, state etc), then you apply the business logic to combine levels to groups. For example zip code or state can be combined to regions, however in this method there is a risk of losing critical information. Another method is to combine categories based on similar frequency (new category can be high, medium, low).  

In [3]:
import pandas as pd

# using pandas package's factorize function
df['A_pd_factorized'] = pd.factorize(df['A'])[0]

# Alternatively you can use sklearn package's LabelEncoder function
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df['A_LabelEncoded'] = le.fit_transform(df.A)
print(df)

        A   B  A_pd_factorized  A_LabelEncoded
0    high  10                0               0
1  medium  20                1               2
2     low  30                2               1
