# Descriptive Analytics and Data Preprocessing

We will import the class DescriptiveAnalysis and DataPreprocessing from CommonFunctions.We will then pass the a pandas data frame when we create objects. This notebook will explore what all can be done using the methods that these objects expose.
If you would like to use the same for your work, you can copy the CommonFunctions.py and then import the required classes

In [1]:
# Importing descriptive analysis
import seaborn as sn
import sys

sys.path.append('../')
from CommonFuncs.CommonFunctions import DescriptiveAnalysis,DataPreprocessing

In [2]:
import seaborn as sn
data = sn.load_dataset("iris")
da = DescriptiveAnalysis(data)

# Descriptive statistics for a continuous variable
x = da.descriptive_stats_cont('sepal_length')

Mean 5.843333333333335
Median 5.8
Mode 0    5.0
dtype: float64
Skew 0.3149109566369728
Kurtosis -0.5520640413156395
Standard Deviation 0.8280661279778629
Min 4.3
Max 7.9


In [3]:
# Descriptive Statistics for a Categorical Variable
x = da.descriptive_stats_categ('species')

species
setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64


In [4]:
# For checking how many records have a particular value
x = da.check_val('sepal_length',5.1)

Count is 9


In [5]:
# You can check for a list of values as well
x = da.check_val('sepal_length',[5.1,5.0,5.2])

Count is 23


Lets introduce some missing values and find them using the DA class

In [6]:
import numpy as np
nanidx = data.sample(frac=0.2).index
data.loc[nanidx,'sepal_length'] = np.NaN

# Do note, that you have to recreate the object as you have modified the data
da = DescriptiveAnalysis(data)

In [7]:
#Now lets check the missing values
x = da.check_missing('sepal_length')

Total missing 30
Total complete cases 120


##  Data Preprocessing

We will now look at several commonly used data preprocessing requirements and see how the DataProcessing class helps in this.

In [8]:
# First we will create a new object with the DataPreprocessing class
dp = DataPreprocessing(data)

### Dealing with missing values
We had introduce missing values in the previous step. We can use several strategies to deal with them. 
First we will check on how to remove the observations with missing values



In [9]:
# Lets look at the data which has missing values
data.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,,3.4,1.4,0.3,setosa
7,,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [10]:
newdf = dp.remove_missing("sepal_length")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
10,5.4,3.7,1.5,0.2,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa


As we can see the missing values have gotten removed. We could also call the function without any column and it would remove missing values from all columns

In [11]:
# Lets re-introduce the missing values

data = sn.load_dataset("iris")
nanidx = data.sample(frac=0.2).index
data.loc[nanidx,'sepal_length'] = np.NaN
nanidx = data.sample(frac=0.2).index
data.loc[nanidx,'petal_length'] = np.NaN
data.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [12]:
newdf = dp.remove_missing()
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
10,5.4,3.7,1.5,0.2,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa


As we can see both all observations that are retained are only complete cases

In [13]:
# Many times the values in missing values are not always NaN, sometimes it's other values blanks or 0 or NULL etc.
# Lets reintroduce some missing and then try to replace them with proper Nans
data = sn.load_dataset("iris")
nanidx = data.sample(frac=0.2).index
data.loc[nanidx,'species'] = 'NULL'


In [14]:
# We will replace the NULL values with NaNs
dp = DataPreprocessing(data)
newdf = dp.mark_missing("species",'NULL')
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,
9,4.9,3.1,1.5,0.1,setosa


We now can adopt a single strategy for dealing with NaN instead of separately dealing with NULL, blanks etc

### Missing Value Imputations

#### Mean Value Imputations

In [15]:
# Lets introduce missing in a continuous variable
data = sn.load_dataset("iris")
nanidx = data.sample(frac=0.5).index
data.loc[nanidx,'sepal_length'] = np.NaN
dp = DataPreprocessing(data)
da = DescriptiveAnalysis(data)
da.check_missing('sepal_length')

Total missing 75
Total complete cases 75


In [16]:
data.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,,3.6,1.4,0.2,setosa
5,,3.9,1.7,0.4,setosa
6,,3.4,1.4,0.3,setosa
7,,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [17]:
newdf = dp.impute_missing("sepal_length",method="mean")
da = DescriptiveAnalysis(newdf)
da.check_missing('sepal_length')

Total missing 0
Total complete cases 150


As we can see , the column has gotten a mean imputation, hence we can no longer see any missing cases.

In [18]:
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.781333,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.781333,3.6,1.4,0.2,setosa
5,5.781333,3.9,1.7,0.4,setosa
6,5.781333,3.4,1.4,0.3,setosa
7,5.781333,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


#### Median Imputation

In [19]:
# Lets introduce missing in a continuous variable
data = sn.load_dataset("iris")
nanidx = data.sample(frac=0.5).index
data.loc[nanidx,'sepal_length'] = np.NaN
dp = DataPreprocessing(data)
newdf = dp.impute_missing("sepal_length",method="median")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.7,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,5.7,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


#### Most Frequent Imputation

In [20]:
# Lets introduce missing in a categorical variable
data = sn.load_dataset("iris")
nanidx = data.sample(frac=0.5).index
data.loc[nanidx,'species'] = np.NaN
dp = DataPreprocessing(data)
data.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,
8,4.4,2.9,1.4,0.2,
9,4.9,3.1,1.5,0.1,setosa


In [21]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values= np.nan,strategy="most_frequent")
data['species'] = imp.fit_transform(data[['species']])
data.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


We can see the missing values getting imputed with the most frequent values

#### Knn Imputation

KNN imputation works on the basis of nearest neighbour imputation.

In [22]:
# Lets introduce missing in a continuous variable
# This method uses the fancyimpute package, so be sure to install that before running this
data = sn.load_dataset("iris")
nanidx = data.sample(frac=0.5).index
data.loc[nanidx,'sepal_length'] = np.NaN
dp = DataPreprocessing(data)
newdf = dp.impute_missing(['sepal_length','petal_length'],method='knn',n=5)


Using TensorFlow backend.


Imputing row 1/150 with 1 missing, elapsed time: 0.008
Imputing row 101/150 with 0 missing, elapsed time: 0.012


## Encoding for Categorical Variables

Lets now see how to use the Data Processing class for dummy encoding.
We have the species as a categorical variable.


#### Label Encoding

In [23]:
data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.dummy_categ("species")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_cat
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0
5,5.4,3.9,1.7,0.4,setosa,0
6,4.6,3.4,1.4,0.3,setosa,0
7,5.0,3.4,1.5,0.2,setosa,0
8,4.4,2.9,1.4,0.2,setosa,0
9,4.9,3.1,1.5,0.1,setosa,0


The species is encoded as 0,1,2 etc. This is a label encoding

#### Binary Encoding

Binary encoding means a 0,1 kind of encoding for specific cases where we want to consider one value vs others.

In [24]:
# Lets encode sentosa as 1 and remaining as 0
data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.binary_encoding("species","setosa")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1
5,5.4,3.9,1.7,0.4,1
6,4.6,3.4,1.4,0.3,1
7,5.0,3.4,1.5,0.2,1
8,4.4,2.9,1.4,0.2,1
9,4.9,3.1,1.5,0.1,1


#### One-hot encoding

In [25]:
# This encodes all values as separate colummns on a 0,1 basis
# Note that this will make dummy encoding for all values in the column, however if you want to drop one column 
# while doing regression, use exclude_one=True

data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.one_hot_encoder("species")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_setosa,species_versicolor,species_virginica
0,5.1,3.5,1.4,0.2,setosa,1,0,0
1,4.9,3.0,1.4,0.2,setosa,1,0,0
2,4.7,3.2,1.3,0.2,setosa,1,0,0
3,4.6,3.1,1.5,0.2,setosa,1,0,0
4,5.0,3.6,1.4,0.2,setosa,1,0,0
5,5.4,3.9,1.7,0.4,setosa,1,0,0
6,4.6,3.4,1.4,0.3,setosa,1,0,0
7,5.0,3.4,1.5,0.2,setosa,1,0,0
8,4.4,2.9,1.4,0.2,setosa,1,0,0
9,4.9,3.1,1.5,0.1,setosa,1,0,0


In [26]:
data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.one_hot_encoder("species",exclude_one=True)
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_setosa,species_versicolor
0,5.1,3.5,1.4,0.2,setosa,1,0
1,4.9,3.0,1.4,0.2,setosa,1,0
2,4.7,3.2,1.3,0.2,setosa,1,0
3,4.6,3.1,1.5,0.2,setosa,1,0
4,5.0,3.6,1.4,0.2,setosa,1,0
5,5.4,3.9,1.7,0.4,setosa,1,0
6,4.6,3.4,1.4,0.3,setosa,1,0
7,5.0,3.4,1.5,0.2,setosa,1,0
8,4.4,2.9,1.4,0.2,setosa,1,0
9,4.9,3.1,1.5,0.1,setosa,1,0


## Other Miscellaneous Preprocessing

Lets look at other types of commonly used preprocessing and how you can use the DataPreprocessing class to use it

#### Removing rows with a value

In [27]:
data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.remove_val("species","setosa")
da = DescriptiveAnalysis(newdf)
da.descriptive_stats_categ("species")

species
versicolor    50
virginica     50
Name: species, dtype: int64


#### Centering

In [28]:
# Centering centers the values around the mean
newdf = dp.centering("sepal_length")
newdf.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_centered
50,7.0,3.2,4.7,1.4,versicolor,0.738
51,6.4,3.2,4.5,1.5,versicolor,0.138
52,6.9,3.1,4.9,1.5,versicolor,0.638
53,5.5,2.3,4.0,1.3,versicolor,-0.762
54,6.5,2.8,4.6,1.5,versicolor,0.238
55,5.7,2.8,4.5,1.3,versicolor,-0.562
56,6.3,3.3,4.7,1.6,versicolor,0.038
57,4.9,2.4,3.3,1.0,versicolor,-1.362
58,6.6,2.9,4.6,1.3,versicolor,0.338
59,5.2,2.7,3.9,1.4,versicolor,-1.062


#### Scaling

In [29]:
newdf = dp.standard_scaling("sepal_length")
newdf.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_centered,sepal_length_scaled
50,7.0,3.2,4.7,1.4,versicolor,0.738,1.119009
51,6.4,3.2,4.5,1.5,versicolor,0.138,0.209246
52,6.9,3.1,4.9,1.5,versicolor,0.638,0.967382
53,5.5,2.3,4.0,1.3,versicolor,-0.762,-1.1554
54,6.5,2.8,4.6,1.5,versicolor,0.238,0.360873
55,5.7,2.8,4.5,1.3,versicolor,-0.562,-0.852145
56,6.3,3.3,4.7,1.6,versicolor,0.038,0.057618
57,4.9,2.4,3.3,1.0,versicolor,-1.362,-2.065164
58,6.6,2.9,4.6,1.3,versicolor,0.338,0.5125
59,5.2,2.7,3.9,1.4,versicolor,-1.062,-1.610282


#### Interaction Variables

In [31]:
data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.one_hot_encoder("species")
newdf = dp.interaction_var_mult("sepal_length","species_setosa")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_setosa,species_versicolor,species_virginica,sepal_length*species_setosa
0,5.1,3.5,1.4,0.2,setosa,1,0,0,5.1
1,4.9,3.0,1.4,0.2,setosa,1,0,0,4.9
2,4.7,3.2,1.3,0.2,setosa,1,0,0,4.7
3,4.6,3.1,1.5,0.2,setosa,1,0,0,4.6
4,5.0,3.6,1.4,0.2,setosa,1,0,0,5.0
5,5.4,3.9,1.7,0.4,setosa,1,0,0,5.4
6,4.6,3.4,1.4,0.3,setosa,1,0,0,4.6
7,5.0,3.4,1.5,0.2,setosa,1,0,0,5.0
8,4.4,2.9,1.4,0.2,setosa,1,0,0,4.4
9,4.9,3.1,1.5,0.1,setosa,1,0,0,4.9


#### Non-Linear Transformations

Many times we need to use non-linear transformations. We will see how to get some of the commonly used transforms

In [32]:
data = sn.load_dataset("iris")
dp = DataPreprocessing(data)
newdf = dp.non_linear_transformations('sepal_length',transform="square")
newdf = dp.non_linear_transformations('sepal_length',transform="cube")
newdf = dp.non_linear_transformations('sepal_length',transform="square_root")
newdf = dp.non_linear_transformations('sepal_length',transform="cube_root")
newdf = dp.non_linear_transformations('sepal_length',transform="ln")
newdf = dp.non_linear_transformations('sepal_length',transform="log")
newdf.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_sq,sepal_length_cube,sepal_length_ln,sepal_length_log
0,5.1,3.5,1.4,0.2,setosa,26.01,132.651,1.629241,0.70757
1,4.9,3.0,1.4,0.2,setosa,24.01,117.649,1.589235,0.690196
2,4.7,3.2,1.3,0.2,setosa,22.09,103.823,1.547563,0.672098
3,4.6,3.1,1.5,0.2,setosa,21.16,97.336,1.526056,0.662758
4,5.0,3.6,1.4,0.2,setosa,25.0,125.0,1.609438,0.69897
5,5.4,3.9,1.7,0.4,setosa,29.16,157.464,1.686399,0.732394
6,4.6,3.4,1.4,0.3,setosa,21.16,97.336,1.526056,0.662758
7,5.0,3.4,1.5,0.2,setosa,25.0,125.0,1.609438,0.69897
8,4.4,2.9,1.4,0.2,setosa,19.36,85.184,1.481605,0.643453
9,4.9,3.1,1.5,0.1,setosa,24.01,117.649,1.589235,0.690196
