# Chapter 7 - Data Cleaning and Preparation

## 7.2 Data Transformation (2)

In [1]:
import re

import pandas as pd
import numpy as np

- Discretisation and Binning
- Detecting and Filtering Outliers
- Permutation and Random Sampling
- Computing Indicator / Dummy Variables
<hr>

In [2]:
df = pd.read_csv('dataset-H3-videos.csv')
display(df.head())
display(df.info())

Unnamed: 0,video_id,views
0,XAzqBDFs418,1375421
1,oRSVrtKph_k,1007920
2,aFuA50H9uek,3643003
3,GhHBfDK4lE8,248880
4,CPjWgk0UXps,1405034


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
video_id    50 non-null object
views       50 non-null int64
dtypes: int64(1), object(1)
memory usage: 880.0+ bytes


None

To separate the dataset into different bins for analysis, use `pd.cut()` and specify the number of `bins`. By default, `pd.cut()` will seperate the dataset into bins of equal size.

In [3]:
pd.cut(df['views'], bins=3)[:20]

0        (14905.822, 6213839.667]
1        (14905.822, 6213839.667]
2        (14905.822, 6213839.667]
3        (14905.822, 6213839.667]
4        (14905.822, 6213839.667]
5        (14905.822, 6213839.667]
6        (14905.822, 6213839.667]
7        (14905.822, 6213839.667]
8        (14905.822, 6213839.667]
9        (14905.822, 6213839.667]
10       (14905.822, 6213839.667]
11       (14905.822, 6213839.667]
12     (12394232.333, 18574625.0]
13       (14905.822, 6213839.667]
14    (6213839.667, 12394232.333]
15       (14905.822, 6213839.667]
16       (14905.822, 6213839.667]
17       (14905.822, 6213839.667]
18       (14905.822, 6213839.667]
19       (14905.822, 6213839.667]
Name: views, dtype: category
Categories (3, interval[float64]): [(14905.822, 6213839.667] < (6213839.667, 12394232.333] < (12394232.333, 18574625.0]]

Consistent with mathematical notation for intervals, a parenthesis `(` means that the side is open (exclusive, does not contain this value), while the square bracket `]` means it is closed (inclusive, contains this value).

To add different labels to each bin, use the `labels` parameter.

In [4]:
pd.cut(df['views'], bins=3, labels=['Class-1', 'Class-2', 'Class-3'])[:20]

0     Class-1
1     Class-1
2     Class-1
3     Class-1
4     Class-1
5     Class-1
6     Class-1
7     Class-1
8     Class-1
9     Class-1
10    Class-1
11    Class-1
12    Class-3
13    Class-1
14    Class-2
15    Class-1
16    Class-1
17    Class-1
18    Class-1
19    Class-1
Name: views, dtype: category
Categories (3, object): [Class-1 < Class-2 < Class-3]

To specify the size of each bin, use a `list` for the bins parameter.

In [5]:
# Use float('inf') to capture all values above 10000000, the largest number for all bins
bins = [0, 1000000, 2000000, 5000000, 10000000, float('inf')]
pd.cut(df['views'], bins=bins)[:20]

0      (1000000.0, 2000000.0]
1      (1000000.0, 2000000.0]
2      (2000000.0, 5000000.0]
3            (0.0, 1000000.0]
4      (1000000.0, 2000000.0]
5      (1000000.0, 2000000.0]
6      (1000000.0, 2000000.0]
7      (1000000.0, 2000000.0]
8            (0.0, 1000000.0]
9            (0.0, 1000000.0]
10           (0.0, 1000000.0]
11     (1000000.0, 2000000.0]
12          (10000000.0, inf]
13     (1000000.0, 2000000.0]
14    (5000000.0, 10000000.0]
15     (2000000.0, 5000000.0]
16     (1000000.0, 2000000.0]
17           (0.0, 1000000.0]
18           (0.0, 1000000.0]
19           (0.0, 1000000.0]
Name: views, dtype: category
Categories (5, interval[float64]): [(0.0, 1000000.0] < (1000000.0, 2000000.0] < (2000000.0, 5000000.0] < (5000000.0, 10000000.0] < (10000000.0, inf]]

To specify the precision of each bin, use `precision=2`. This applies to data containing floats. (Note: no demo)

A closely related function is `pd.qcut()`. This will separate the dataset into equal quantiles. Specifically, specify `4` for quartiles, `10` for deciles and `100` for percentiles. A dataset is divided into quantiles where each quantile (usually) contains an equal number of samples in the dataset.

In [6]:
c = pd.qcut(df['views'], 5)
display(c[:10])

0     (1142321.6, 1680947.6]
1      (563492.8, 1142321.6]
2    (1680947.6, 18574625.0]
3       (246353.0, 563492.8]
4     (1142321.6, 1680947.6]
5     (1142321.6, 1680947.6]
6      (563492.8, 1142321.6]
7      (563492.8, 1142321.6]
8      (563492.8, 1142321.6]
9       (246353.0, 563492.8]
Name: views, dtype: category
Categories (5, interval[float64]): [(33446.999, 246353.0] < (246353.0, 563492.8] < (563492.8, 1142321.6] < (1142321.6, 1680947.6] < (1680947.6, 18574625.0]]

Recall that `value_counts()` will give the unique categories and their associated number of counts.

In [7]:
display(c.value_counts())

(1680947.6, 18574625.0]    10
(1142321.6, 1680947.6]     10
(563492.8, 1142321.6]      10
(246353.0, 563492.8]       10
(33446.999, 246353.0]      10
Name: views, dtype: int64

<hr>

To detect outliers, use filtering techniques to do so. Here are 2 ways to detect outliers: using $z-$score or using the inter-quartile range method.

For inter-quartile range method, first calculate the inter-quartile range, $p25$ and $p75$ which is the first and third quartile. Then, letting $IQR=p75-p25$, find all values that are smaller than $p25-1.5\times IQR$ or values greater than $p75+1.5\times IQR$.

In [8]:
views = df.copy()['views']
# Use np.percentile to get the first and third quartile
p25, p75 = np.percentile(views, [25, 75])
print(p25, p75)
# Calculate the IQR which is p75 - p25, and then filter for all values outside
# the range [q1-1.5*iqr, q3+1.5*iqr]
lower_bound, upper_bound = max(0, p25-1.5*(p75-p25) ), p75+1.5*(p75-p25)
print(lower_bound, upper_bound)
display(views[views>upper_bound])

277719.5 1558396.25
0 3479411.375


2      3643003
12    16374553
14     9816365
27     3652424
29    18574625
Name: views, dtype: int64

For $z$-score method, calculate the $z$-score for each observation, $x_i$ using $z_i=\frac{x_i-\mu}{\sigma}$ . $\mu$ is the sample mean and can be calculated using `Series.mean()` and $\sigma$ is the sample standard variation and can be calculated using `Series.std()`. Then, filter for all observations where the $z$-score is greater than $3$ or less than $-3$. This can be represented using the inequality $|z|>3$ where $|z|$ is the absolute value of $z$.

In [9]:
views = df.copy()['views']
# Calculate the z-score using the mean and SD of the values
z_scores = (views-views.mean())/views.std()
# Note: scipy.stats has a simple function to calculate z-score called zscore()

# Filter for all values with z-score greater than 3
views[z_scores[z_scores.abs()>3].index]

12    16374553
29    18574625
Name: views, dtype: int64

<hr>
To randomly sample from a dataset, use `df.sample()`. To obtain one permutation of the dataset, use `df.sample(df.shape[0])` where `df.shape[0]` gives the number of records in the dataset.

Note that to sample with replace, use `df.sample(replace=True)`.

In [10]:
df2 = pd.read_csv('dataset-A-loans.csv', index_col=0)
display(df2)
display(df2.sample(3))
display(df2.sample(df2.shape[0]))

Unnamed: 0,loan_amnt,int_rate,term,grade
48304290,30000.0,8.18,36 months,B
49904421,14225.0,13.33,60 months,C
32038416,12000.0,20.2,60 months,E
11456303,18000.0,8.39,36 months,A
23613274,4000.0,12.49,36 months,B
55949701,15000.0,16.99,60 months,D


Unnamed: 0,loan_amnt,int_rate,term,grade
11456303,18000.0,8.39,36 months,A
48304290,30000.0,8.18,36 months,B
55949701,15000.0,16.99,60 months,D


Unnamed: 0,loan_amnt,int_rate,term,grade
32038416,12000.0,20.2,60 months,E
48304290,30000.0,8.18,36 months,B
49904421,14225.0,13.33,60 months,C
55949701,15000.0,16.99,60 months,D
11456303,18000.0,8.39,36 months,A
23613274,4000.0,12.49,36 months,B


<hr>
To get dummy variables, use `pd.get_dummies`. To rename the columns, use the `prefix` parameter.

In [11]:
display(df2['grade'])
display(pd.get_dummies(df2['grade']))
display(pd.get_dummies(df2['grade'], prefix='grade'))

48304290    B
49904421    C
32038416    E
11456303    A
23613274    B
55949701    D
Name: grade, dtype: object

Unnamed: 0,A,B,C,D,E
48304290,0,1,0,0,0
49904421,0,0,1,0,0
32038416,0,0,0,0,1
11456303,1,0,0,0,0
23613274,0,1,0,0,0
55949701,0,0,0,1,0


Unnamed: 0,grade_A,grade_B,grade_C,grade_D,grade_E
48304290,0,1,0,0,0
49904421,0,0,1,0,0
32038416,0,0,0,0,1
11456303,1,0,0,0,0
23613274,0,1,0,0,0
55949701,0,0,0,1,0


In more complex cases to find indicator variables, first perform some data wrangling.

In [12]:
df3 = pd.read_csv('dataset-H4-videos.csv', sep="#")
df3

Unnamed: 0,video_id,title,tags
0,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
1,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
2,bg7RjxsghNY,Camila Cabello - Real Friends (Audio),"camila cabello|""real friends""|""camila""|""camili..."
3,qooQd8AA7_M,"Camila Cabello, Daddy Yankee - Havana (Remix -...","camila cabello|""camila""|""daddy yankee""|""havana..."
4,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
5,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
6,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
7,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
8,Ph54wQG8ynk,Camila Cabello - Never Be the Same,"camila cabello|""camila""|""camila full album""|""h..."
9,bg7RjxsghNY,Camila Cabello - Real Friends (Audio),"camila cabello|""real friends""|""camila""|""camili..."


In [13]:
all_tags = []

# iterate through all the tags, perform some data wrangling and obtain
# the list of all unique tags
for t in df3['tags'].tolist():
    # Split to get the unique tags
    t_tags = (t.split('|'))
    # Transform all words to lower case and remove all special characters    
    t_tags = [s.lower() for s in t_tags]
    t_tags = [re.sub('[^\sa-z]', '', s) for s in t_tags]
    all_tags.extend(t_tags)
all_tags = pd.unique(all_tags)
display(all_tags)

array(['camila cabello', 'camila', 'camila full album', 'havana',
       'never be the same', 'all these years', 'she loves control',
       'inside out', 'consequences', 'somethings gotta give',
       'in the dark', 'into it', 'crying in the club', 'i have questions',
       'fifth harmony', 'camilizers', 'harmonizers', 'pop',
       'syco musicepic', 'real friends', 'omg', 'know no better', 'h',
       'young thug', 'havana feat young thug', 'daddy yankee', 'lele',
       'pons', 'lelepons', 'shawn mendes', 'lele pons', 'lujuan james',
       'influencers', 'instagram influencer', 'lele instagram',
       'latina influencer', 'lelepons dance videos',
       'lele pons dance battle', 'camila cabello  daddy yankee', 'latin'],
      dtype=object)

In [14]:
pd.options.display.max_columns = None

Then, first get a matrix of all zeros. To get the indicator variables, every time a tag is encountered, transform that cell from `0` to `1`.

In [15]:
zeros_matrix = np.zeros((df3.shape[0], len(all_tags)))
dummies_df = pd.DataFrame(zeros_matrix, columns=all_tags)
dummies_df = dummies_df.astype(int)

# Iterate through every row.
for i, v in df3.iterrows():
    tags = v['tags']
    # Get all tags (after cleaning)
    t_tags = tags.split('|')
    t_tags = [s.lower() for s in t_tags]
    t_tags = [re.sub('[^\sa-z]', '', s) for s in t_tags]
    # For all tags, set the cell in the df to 1
    for t in t_tags:
        dummies_df.loc[i, t] = 1
    print(t_tags)
    display(dummies_df.iloc[i:i+1])

['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'real friends', 'camila', 'camilizers', 'fifth harmony', 'harmonizers', 'havana', 'omg', 'crying in the club', 'i have questions', 'know no better', 'h', 'never be the same', 'all these years', 'she loves control', 'young thug', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'havana feat young thug', 'camila cabello', 'pop', 'real friends', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
2,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'camila', 'daddy yankee', 'havana', 'fifth harmony', 'harmonizers', 'h', 'omg', 'crying in the club', 'i have questions', 'know no better', 'lele', 'pons', 'lelepons', 'shawn mendes', 'lele pons', 'lujuan james', 'influencers', 'instagram influencer', 'lele instagram', 'latina influencer', 'lelepons dance videos', 'lele pons dance battle', 'camila cabello  daddy yankee', 'havana', 'latin', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
3,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'camila', 'camila full album', 'havana', 'never be the same', 'all these years', 'she loves control', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'crying in the club', 'i have questions', 'fifth harmony', 'camilizers', 'harmonizers', 'camila cabello', 'never be the same', 'pop', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
8,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


['camila cabello', 'real friends', 'camila', 'camilizers', 'fifth harmony', 'harmonizers', 'havana', 'omg', 'crying in the club', 'i have questions', 'know no better', 'h', 'never be the same', 'all these years', 'she loves control', 'young thug', 'inside out', 'consequences', 'somethings gotta give', 'in the dark', 'into it', 'havana feat young thug', 'camila cabello', 'pop', 'real friends', 'syco musicepic']


Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
9,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [16]:
display(dummies_df)

Unnamed: 0,camila cabello,camila,camila full album,havana,never be the same,all these years,she loves control,inside out,consequences,somethings gotta give,in the dark,into it,crying in the club,i have questions,fifth harmony,camilizers,harmonizers,pop,syco musicepic,real friends,omg,know no better,h,young thug,havana feat young thug,daddy yankee,lele,pons,lelepons,shawn mendes,lele pons,lujuan james,influencers,instagram influencer,lele instagram,latina influencer,lelepons dance videos,lele pons dance battle,camila cabello daddy yankee,latin
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Getting dummy variables for bins also can be done. First, get the bins after performing the binning step. Then, use `pd.get_dummies()` to get the indicator variables for each bin.

In [17]:
df = pd.read_csv('dataset-H3-videos.csv')
# Get the dummy variables based on splitting using qcut
df['views_bins'] = pd.qcut(df['views'],4)
display(df.head(10))

# Use pd.get_dummies() to perform getting the dummy variables
bins_indicators = pd.get_dummies(df['views_bins'])
bins_indicators.columns = [str(i) for i in bins_indicators.columns]
display(bins_indicators.head(10))

# Finally, use df.join() to join the 2 dfs by the index
df.join(bins_indicators)

Unnamed: 0,video_id,views,views_bins
0,XAzqBDFs418,1375421,"(732104.0, 1558396.25]"
1,oRSVrtKph_k,1007920,"(732104.0, 1558396.25]"
2,aFuA50H9uek,3643003,"(1558396.25, 18574625.0]"
3,GhHBfDK4lE8,248880,"(33446.999, 277719.5]"
4,CPjWgk0UXps,1405034,"(732104.0, 1558396.25]"
5,8EK-QMtHhMI,1503192,"(732104.0, 1558396.25]"
6,a30K69hUJyo,1139752,"(732104.0, 1558396.25]"
7,dLRMA_lWsDY,1090128,"(732104.0, 1558396.25]"
8,rqTpMCq8uhk,606312,"(277719.5, 732104.0]"
9,3gTyF-wLa-E,264956,"(33446.999, 277719.5]"


Unnamed: 0,"(33446.999, 277719.5]","(277719.5, 732104.0]","(732104.0, 1558396.25]","(1558396.25, 18574625.0]"
0,0,0,1,0
1,0,0,1,0
2,0,0,0,1
3,1,0,0,0
4,0,0,1,0
5,0,0,1,0
6,0,0,1,0
7,0,0,1,0
8,0,1,0,0
9,1,0,0,0


Unnamed: 0,video_id,views,views_bins,"(33446.999, 277719.5]","(277719.5, 732104.0]","(732104.0, 1558396.25]","(1558396.25, 18574625.0]"
0,XAzqBDFs418,1375421,"(732104.0, 1558396.25]",0,0,1,0
1,oRSVrtKph_k,1007920,"(732104.0, 1558396.25]",0,0,1,0
2,aFuA50H9uek,3643003,"(1558396.25, 18574625.0]",0,0,0,1
3,GhHBfDK4lE8,248880,"(33446.999, 277719.5]",1,0,0,0
4,CPjWgk0UXps,1405034,"(732104.0, 1558396.25]",0,0,1,0
5,8EK-QMtHhMI,1503192,"(732104.0, 1558396.25]",0,0,1,0
6,a30K69hUJyo,1139752,"(732104.0, 1558396.25]",0,0,1,0
7,dLRMA_lWsDY,1090128,"(732104.0, 1558396.25]",0,0,1,0
8,rqTpMCq8uhk,606312,"(277719.5, 732104.0]",0,1,0,0
9,3gTyF-wLa-E,264956,"(33446.999, 277719.5]",1,0,0,0


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)