# Data Cleaning and Preparation

Data Preparation tasks such as loading, cleaning, trasforming and rearranging are ofen reported to take up 80% or more of an analyst's time. Pandas and other python features provide flexible and fast set of tools for manipulate data into the right form.

Datasets: https://github.com/wesm/pydata-book/tree/3rd-edition/datasets 

## Index

* [Handling Missing Data](#handling-missing-data)
    * [Filtering Out Missing Data](#filtering-out-missing-data)
    * [Filling In Missing Data](#filling-in-missing-data)
* [Data Transformation](#data-transformation)
    * [Removing Duplicates](#removing-duplicates)
    * [Trasforming Data Using a Function or Mapping](#trasforming-data-using-a-function-or-mapping)
    * [Replacing Values](#replacing-values)
    * [Renaming Axis Indexes](#renaming-axis-indexes)
    * [Discretization and Binning](#discretization-and-binning)
    * [Detecting and filtering utliers](#detecting-and-filtering-outliers)
    * [Permutation and Random Sampling](#permutation-and-random-sampling)
    * [Computing Indicator/Dummy Variables](#computing-indicatordummy-variables)
* [Extension Data Types](#extension-data-types)
* [String Manipulation](#string-manipulation)
    * [Python Built-In String Object Methods](#python-built-in-string-object-methods)
    * [Regular Expressions](#regular-expressions)
    * [String Functions in Pandas](#string-functions-in-pandas)
* [Categorical Data](#categorical-data)
    * [Background and Motivation](#background-and-motivation)
    * [Categorical Extension Type in pandas](#categorical-extension-type-in-pandas)
    * [Computations with Categoricals](#computations-with-categoricals)
    * [Categorical Methods](#categorical-methods)
    

## Handling Missing Data

*NA handling object methods*
|Method|Description|
|---|---|
|dropna |Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.|
|fillna |Fill in missing data with some value or using an interpolation method such as "ffill" or "bfill".|
|isna |Return Boolean values indicating which values are missing/NA.|
|notna |Negation of isna, returns True for non-NA values and False for NA values|

In [6]:
import pandas as pd 
import numpy as np 

flo_data = pd.Series([1.2, -3.5, np.nan, 0])

print(f"Series \n{flo_data}")
print(f"\nMissing data: \n{flo_data.isna()}")

# python 'None', is also treated as NA
flo_data = pd.Series([1.2, -3.5, np.nan, 0, None])

print(f"\nSeries \n{flo_data}")
print(f"\nMissing data: \n{flo_data.isna()}")

Series 
0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

Missing data: 
0    False
1    False
2     True
3    False
dtype: bool
Series 
0    1.2
1   -3.5
2    NaN
3    0.0
4    NaN
dtype: float64

Missing data: 
0    False
1    False
2     True
3    False
4     True
dtype: bool


### Filtering Out Missing Data

In [8]:
import pandas as pd 
import numpy as np 

flo_data = pd.Series([1.2, -3.5, np.nan, 0, None])

print(f"Series wthout NA values: \n{flo_data.dropna()}")
# Same result with filter
print(f"\nUsing filter flo_data[flo_data.notna()]: \n{flo_data[flo_data.notna()]}")

Series wthout NA values: 
0    1.2
1   -3.5
3    0.0
dtype: float64

Using filter flo_data[flo_data.notna()]: 
0    1.2
1   -3.5
3    0.0
dtype: float64


In [11]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame([[2., 5.5, 3.], 
                      [2., np.nan, np.nan], 
                      [np.nan, np.nan, np.nan], 
                      [np.nan, 5.5, 3.]])

print(f"DFrame: \n{frame}")

print(f"\nDFrame without rows containing NA values: \n{frame.dropna()}")
print(f"\nDFrame dropping full NA rows: \n{frame.dropna(how='all')}")

frame[4] = np.nan 
print(f"DFrame with new column 4 with NA values: \n{frame}")
print(f"\nDropping full NA columns specifying axis:"
      f"\n{frame.dropna(axis='columns', how='all')}")

DFrame: 
     0    1    2
0  2.0  5.5  3.0
1  2.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  5.5  3.0

DFrame without rows containing NA values: 
     0    1    2
0  2.0  5.5  3.0

DFrame dropping full NA rows: 
     0    1    2
0  2.0  5.5  3.0
1  2.0  NaN  NaN
3  NaN  5.5  3.0
DFrame with new column 4 with NA values: 
     0    1    2   4
0  2.0  5.5  3.0 NaN
1  2.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  5.5  3.0 NaN

Dropping full NA columns specifying axis:
     0    1    2
0  2.0  5.5  3.0
1  2.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  5.5  3.0


In [13]:
## Keep rows with at most a certain number of missing observations
import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan 
df.iloc[:2, 2] = np.nan 

print(f"DataFrame: \n{df}")

print(f"\nDropping rows with NA values: \n{df.dropna()}")

print(f"\nDropping rows with at least 2 NA values: \n{df.dropna(thresh=2)}")

DataFrame: 
          0         1         2
0 -0.424858       NaN       NaN
1 -0.156521       NaN       NaN
2 -0.475313       NaN  0.014500
3 -0.299724       NaN -0.754694
4 -0.068748 -0.675605  0.069436
5 -0.223925 -0.006411  2.636000
6 -1.614529  1.470182 -1.130516

Dropping rows with NA values: 
          0         1         2
4 -0.068748 -0.675605  0.069436
5 -0.223925 -0.006411  2.636000
6 -1.614529  1.470182 -1.130516

Dropping rows with at least 2 NA values: 
          0         1         2
2 -0.475313       NaN  0.014500
3 -0.299724       NaN -0.754694
4 -0.068748 -0.675605  0.069436
5 -0.223925 -0.006411  2.636000
6 -1.614529  1.470182 -1.130516


### Filling In Missing Data

Filling misssing data prevent us of discard some data with it. The `fillna()` method will help in this task. Otherwise, we can use *method='ffill'* argument with or without *limit* to fill NA values with values of the other cells in the same column. Now, argument *method='ffill'* is deprecated, instead we can use `ffill()` method or `bfill()` method.

*fillna Function arguments*
|Argument|Description|
|---|---|
|value |Scalar value or dictionary-like object to use to fill missing values|
|method |Interpolation method: one of "bfill" (backward fill) or "ffill" (forward fill); default is None|
|axis |Axis to fill on ("index" or "columns"); default is axis="index"|
|limit |For forward and backward filling, maximum number of consecutive periods to fill|

In [14]:
import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan 
df.iloc[:2, 2] = np.nan 

print(f"DataFrame: \n{df}")

print(f"\nDataFrame filling NA with '0': \n{df.fillna(0)}")

DataFrame: 
          0         1         2
0 -0.260455       NaN       NaN
1 -1.361895       NaN       NaN
2  0.485779       NaN -0.932451
3 -0.977748       NaN  0.855924
4  0.173131  1.399493  0.354406
5  0.039476 -0.582290 -1.913124
6 -0.291814  0.146749 -0.044685

DataFrame filling NA with '0': 
          0         1         2
0 -0.260455  0.000000  0.000000
1 -1.361895  0.000000  0.000000
2  0.485779  0.000000 -0.932451
3 -0.977748  0.000000  0.855924
4  0.173131  1.399493  0.354406
5  0.039476 -0.582290 -1.913124
6 -0.291814  0.146749 -0.044685


In [18]:
import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[2:, 1] = np.nan 
df.iloc[4:, 2] = np.nan 

print(f"DataFrame: \n{df}")

#print(f"\nDataFrame filling NA with values in the same column:"
#      f"\n{df.fillna(method='ffill', limit=2)}")

print(f"\nDataFrame filling NA with values in the same column:"
      f"\n{df.ffill(limit=2)}")


DataFrame: 
          0         1         2
0  0.925019  0.332396  1.077413
1  0.360310  0.789294  0.749743
2  0.606949       NaN -0.234555
3 -0.404238       NaN  1.091170
4 -2.075037       NaN       NaN
5 -0.521436       NaN       NaN
6 -0.551778       NaN       NaN

DataFrame filling NA with values in the same column:
          0         1         2
0  0.925019  0.332396  1.077413
1  0.360310  0.789294  0.749743
2  0.606949  0.789294 -0.234555
3 -0.404238  0.789294  1.091170
4 -2.075037       NaN  1.091170
5 -0.521436       NaN  1.091170
6 -0.551778       NaN       NaN


In [19]:
## Filling data in a Series with the mean of the series
import pandas as pd 
import numpy as np 

data = pd.Series([1., np.nan, 3.5, np.nan, 7])

print(f"\nOriginal Series: \n{data}")

print(f"\nFillin NA with 'mean': \n{data.fillna(data.mean())}")



Original Series: 
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

Fillin NA with 'mean': 
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64


## Data Transformation

### Removing Duplicates

The DataFrame method `duplicated` returns a True or False whether or not the row is a duplicate (its column values are exactly eual to those in an earlier row). `drop_duplicated()` discards duplicated rows. By default `duplicated()` and `drop_duplicated()` keep the first observed value, we can change this behavior passing the argument `keep="last"`, this will keep the last observerd rows.

In [4]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame({
    "k1": ["one", "two"] * 3 + ["two"],
    "k2": [1, 1, 2, 3, 3, 4, 4]
})

print(f"DataFrame: \n{frame}")

print(f"\nDuplicated rows: \n{frame.duplicated()}")

# Returning DataFrame without duplicated rows 
print(f"\nDropping Duplicates: \n{frame.drop_duplicates()}")

# Add column 'v1' with range(7)
frame["v1"] = range(7)

# Duplicated values based only on "k1" column
print(f"\nNon-duplicated values based on 'K1' column:"
      f"\n{frame.drop_duplicates(subset=['k1'])}")

# Keeping last observed values
print(f"\nDrop duplicates keeping last rows:"
      f"\n{frame.drop_duplicates(['k1', 'k2'], keep='last')}")

DataFrame: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

Duplicated rows: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Dropping Duplicates: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

Non-duplicated values based on 'K1' column:
    k1  k2  v1
0  one   1   0
1  two   1   1

Drop duplicates keeping last rows:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
6  two   4   6


### Trasforming Data Using a Function or Mapping

Next, we are going to add a column based on other dictionary. We'll use `map()` method which accepts a function or dictionary-like object containing a mapping to do the transformation of values.

In [7]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame({
    "food": ["bacon", "pulled pork", "bacon",
             "pastrami", "corned beef", "bacon",
             "pastrami", "honey ham", "nova lox"],
             "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
})

meat_to_animal = {
    "bacon": "pig",
    "pulled pork": "pig",
    "pastrami": "cow",
    "corned beef": "cow",
    "honey ham": "pig",
    "nova lox": "salmon"
}

print(f"DataFrame: \n{frame}")
print(f"\nDictionary, meat that corresponds to which animal:"
      f"\n{meat_to_animal}")

# Creating "animal" column in 'frame'
# Then, considering 'food' column we map() the 'meat_to_animal' dictionary 
frame["animal"] = frame["food"].map(meat_to_animal)

"""
Alternative using a function:
def get_animal(x):
    return meat_to_animal[x]

frame["animal"] = frame["food"].map(get_animal)
"""

print(f"\nUpdated DataFrame: \n{frame}")

DataFrame: 
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     pastrami     6.0
4  corned beef     7.5
5        bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

Dictionary, meat that corresponds to which animal:
{'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon'}

Updated DataFrame: 
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     pastrami     6.0     cow
4  corned beef     7.5     cow
5        bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon


### Replacing Values

Sometimes, extreme values like -999 represents missing data. `replace()` can provides a simpler and more flexible way to modidy values than `map()`. The `data.replace()` method is distinct from `data.str.replace()` which performs element-wise string substitution.
```python
# Replace one value:
data.replace(-999, np.nan)
# Replace a list of values:
data.replace([-999, -1000], np.nan)
# Replace with diferent values:
data.replace([-999, -1000], [np.nan, 0])
# Replace with a dictionary:
data.replace({-999: np.nan, -1000: 0})
```

### Renaming Axis Indexes



In [5]:
import pandas as pd 
import numpy as np 

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["Ohio", "Colorado", "New York"],
                    columns=["one", "two", "three", "four"])

# Function returning first 4 letters in uppercase 
def transform(x):
    return x[:4].upper()

print(f"DataFrame before changes: \n{data}")
# Modifying the DataFrame index directly
data.index = data.index.map(transform)

# And renaming as title case, no uppercase
data2 = data.rename(index=str.title, columns=str.upper)

print(f"\nDataFrame after changes: \n{data2}")

# Renaming with dictionary
data3 = data.rename(index={"OHIO": "INDIANA"},
                    columns={"three": "peekaboo"})

print(f"\nDataFrame after renaming: \n{data3}")

DataFrame before changes: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

DataFrame after changes: 
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

DataFrame after renaming: 
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11


### Discretization and Binning

Continuous data is often discretized or otherwise separated into 'bins' for analysis. `pandas.cut()` returns a special Categorical object, each bin is identified by a special interval value type.

Categories will return as: `Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]`, a parenthesis means that the side is open (exclusve), while the squeare bracket means it is closed (inclusive). It can be changed passing `right=false` argument: `pd.cut(ages, bins, right=False)`.

`padas.cut()` with an integer number instead of explicit bin edges, it will compute equal-length bins. If you are working with decimal numbers ou can specified how many decimals will cut() consider to create bins, e.g. `pd.cut(data, 4, precision=2)` where 'precision' argument limits the decimal to two digits.

Using `pandas.qcut()` bins the data on quantiles. Depending on the distribution of the data, using `pandas.cut()` will not usually result in each bin habing the same number of data ponts. But `pandas.qcut()` use sample quantiles that will give you equally sized bins.

In [8]:
import pandas as pd 
import numpy as np 

ages = [20, 22, 45, 61, 27, 34, 38, 39, 41, 43, 25, 29, 31, 55, 62]

# Dividing 'ages' into bins: 18-25, 26-35, 36-60, 61+
bins = [18, 25, 35, 60, 100]

age_cat = pd.cut(ages, bins)

print(f"{age_cat}")

# Exploring new categorical valieable 'age_cat'
print(f"\n{age_cat.codes}")
print(f"\nCategories in 'age_cat': \n{age_cat.categories}")
print(f"\nCategorie '0' is: {age_cat.categories[0]}")
print(f"\nCounting values in each categorie: \n{age_cat.value_counts()}")


[(18, 25], (18, 25], (35, 60], (60, 100], (25, 35], ..., (18, 25], (25, 35], (25, 35], (35, 60], (60, 100]]
Length: 15
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

[0 0 2 3 1 1 2 2 2 2 0 1 1 2 3]

Categories in 'age_cat': 
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

Categorie '0' is: (18, 25]

Counting values in each categorie: 
(18, 25]     3
(25, 35]     4
(35, 60]     6
(60, 100]    2
Name: count, dtype: int64


In [9]:
import pandas as pd 
import numpy as np 

ages = [20, 22, 45, 61, 27, 34, 38, 39, 41, 43, 25, 29, 31, 55, 62]

# Dividing 'ages' into bins: 18-25, 26-35, 36-60, 61+
bins = [18, 25, 35, 60, 100]
age_labels = ["youth", "youngAdult", "middleAged", "Senior"]

age_cat = pd.cut(ages, bins, right=False, labels=age_labels)

print(f"{age_cat}")
print(f"\nCategories in 'age_cat': \n{age_cat.categories}")

['youth', 'youth', 'middleAged', 'Senior', 'youngAdult', ..., 'youngAdult', 'youngAdult', 'youngAdult', 'middleAged', 'Senior']
Length: 15
Categories (4, object): ['youth' < 'youngAdult' < 'middleAged' < 'Senior']

Categories in 'age_cat': 
Index(['youth', 'youngAdult', 'middleAged', 'Senior'], dtype='object')


In [12]:
import pandas as pd 
import numpy as np 

data = np.random.standard_normal(1000)

data_cut = pd.cut(data, 4, precision=2)
print(f"Data cut: \n{data_cut.value_counts()}")

data_qcut = pd.qcut(data, 4, precision=2)
print(f"\nData q-cut: \n{data_qcut.value_counts()}")

print(f"\nq-cut with specific cut values: \n"
      f"{pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]).value_counts()}"
      )

Data cut: 
(-3.16, -1.55]     56
(-1.55, 0.058]    478
(0.058, 1.66]     415
(1.66, 3.27]       51
Name: count, dtype: int64

Data q-cut: 
(-3.1599999999999997, -0.72]    250
(-0.72, -0.033]                 250
(-0.033, 0.63]                  250
(0.63, 3.27]                    250
Name: count, dtype: int64

q-cut with specific cut values: 
(-3.15, -1.248]      100
(-1.248, -0.0334]    400
(-0.0334, 1.269]     400
(1.269, 3.265]       100
Name: count, dtype: int64


### Detecting and Filtering Outliers



In [15]:
import pandas as pd 
import numpy as np 

data =  pd.DataFrame(np.random.standard_normal((2000, 4)))

print(f"Describing 'data' DataFrame: \n{data.describe()}")

# Finding values in column 2 with values exceding 3 
#col2 = data[2]
#print(f"\nValues > 3 in col 2: \n{col2[col2.abs() > 3]}")

# Selecting rows having a value exceeding 3 or -3
print(f"\n{data[(data.abs() > 3).any(axis='columns')]}")

# To cap values outside the interval -3 to 3:
data[data.abs() > 3 ] = np.sign(data) * 3
# np.sign(data) produces 1 and -1 values based on positive/negative value

print(f"\nDescribe data with cap: \n{data.describe()}")


Describing 'data' DataFrame: 
                 0            1            2            3
count  2000.000000  2000.000000  2000.000000  2000.000000
mean     -0.008834    -0.025125     0.021726    -0.011364
std       1.010555     1.002944     0.992115     1.019340
min      -4.569266    -4.436342    -3.466420    -3.636668
25%      -0.713590    -0.715966    -0.625877    -0.687284
50%       0.018050    -0.020463     0.040180    -0.027966
75%       0.662537     0.672921     0.670434     0.666354
max       3.811795     3.532332     3.943144     2.932167

             0         1         2         3
93    0.765138 -0.309372  3.012344  0.832574
100   3.195022  2.122922 -0.207912 -0.720975
118  -0.728464  0.716605  3.613930 -0.039634
296   1.911241 -0.513293  3.379291  0.179245
367   0.102828 -4.436342  0.572497 -1.093080
395   1.276417 -1.877794 -0.632602 -3.636668
428   0.561925 -3.561330  0.162459  0.530968
485  -3.169785 -0.231187  0.018173  0.280108
492   3.811795  0.473920 -0.812615 -0.5915

### Permutation and Random Sampling

We can permutate (randomly reordering) with `numpy.random.permutation()` function. We generate a new variable of n numbers where n is the length of the index or number of columns. Then, we use this variable (sample in this example) to print the data with this permutation using `take()` or `iloc()`.

Also, we can extract a random subset using `sample()` method. The argument `replace=True` allow us to generate a sample with replacement which allow repeat choices.

In [21]:
import pandas as pd
import numpy as np 

data = pd.DataFrame(np.arange(5 * 7).reshape((5, 7)))

# Permutation of 5/7 number as data index length
sampler5 = np.random.permutation(5)
sampler7 = np.random.permutation(7)

print(f"\nPermutating DataFrame's columns with 'take()' '{sampler7}':"
      f"\n{data.take(sampler7, axis='columns')}"
      )

print(f"\nPermutating DataFrame's rows with 'iloc()' '{sampler5}':"
      f"\n{data.iloc[sampler5]}"
      )

print(f"\nRandom sample of 2 rows: \n{data.sample(n=2)}")

numbers = pd.Series([3, 5, 7 , -2, -9])
print(f"\nRandom sample of 7 number with repetition (replacement):"
      f"\n{numbers.sample(n=7, replace=True)}")


Permutating DataFrame's columns with 'take()' '[5 1 6 3 0 2 4]':
    5   1   6   3   0   2   4
0   5   1   6   3   0   2   4
1  12   8  13  10   7   9  11
2  19  15  20  17  14  16  18
3  26  22  27  24  21  23  25
4  33  29  34  31  28  30  32

Permutating DataFrame's rows with 'iloc()' '[4 1 0 3 2]':
    0   1   2   3   4   5   6
4  28  29  30  31  32  33  34
1   7   8   9  10  11  12  13
0   0   1   2   3   4   5   6
3  21  22  23  24  25  26  27
2  14  15  16  17  18  19  20

Random sample of 2 rows: 
    0   1   2   3   4   5   6
4  28  29  30  31  32  33  34
1   7   8   9  10  11  12  13

Random sample of 7 number with repetition (replacement):
4   -9
1    5
1    5
4   -9
2    7
3   -2
3   -2
dtype: int64


### Computing Indicator/Dummy Variables

Converting a categorical variable into a *dummy* or *indicator* matrix is usefull for statistical modeling or machine learning. Pandas has `pandas.get_dummies()` function to derive a matrix with *k* columns containing all *1s* and *0s* (True or False) corresponding to each categorie. 

For larger data, it would be better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame.

In [24]:
import pandas as pd 
import numpy as np 

df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6)})

print(f"DataFrame: \n{df}")

print(f"\nDataFrame dummies: \n{pd.get_dummies(df['key'])}")


DataFrame: 
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5

DataFrame dummies: 
       a      b      c
0  False   True  False
1  False   True  False
2   True  False  False
3  False  False   True
4   True  False  False
5  False   True  False


In [26]:
import pandas as pd 
import numpy as np 

df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6)})

dummies = pd.get_dummies(df["key"], prefix="key")
df_dummies = df[["data1"]].join(dummies)

df_dummies = df_dummies.replace({True:1, False:0})

print(f"DataFrame with dummies:\n{df_dummies}")

DataFrame with dummies:
   data1  key_a  key_b  key_c
0      0      0      1      0
1      1      0      1      0
2      2      1      0      0
3      3      0      0      1
4      4      1      0      0
5      5      0      1      0


  df_dummies = df_dummies.replace({True:1, False:0})


In [11]:
import pandas as pd 
import numpy as np 

mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movies.dat", sep="::", 
                       header=None, names=mnames, engine="python")

print(f"10 rows from 'movies.dat' table: \n{movies[15:25]}")

m_dummies = movies["genres"].str.get_dummies("|")
print(f"\nCategories in the first 10 movies: \n{m_dummies.iloc[:10, :6]}")

movies_windic = movies.join(m_dummies.add_prefix("Genre_"))
print(f"\n{movies_windic.iloc[0]}")

10 rows from 'movies.dat' table: 
    movie_id                                  title                genres
15        16                          Casino (1995)        Drama|Thriller
16        17           Sense and Sensibility (1995)         Drama|Romance
17        18                      Four Rooms (1995)              Thriller
18        19  Ace Ventura: When Nature Calls (1995)                Comedy
19        20                     Money Train (1995)                Action
20        21                      Get Shorty (1995)   Action|Comedy|Drama
21        22                         Copycat (1995)  Crime|Drama|Thriller
22        23                       Assassins (1995)              Thriller
23        24                          Powder (1995)          Drama|Sci-Fi
24        25               Leaving Las Vegas (1995)         Drama|Romance

Categories in the first 10 movies: 
   Action  Adventure  Animation  Children's  Comedy  Crime
0       0          0          1           1       1     

In [13]:
import pandas as pd 
import numpy as np 

np.random.seed(255198)

values = np.random.uniform(size=10)

print(f"Array of values: \n{values}")

# Combining pandas.get_dummies() with pandas.cut()
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
cat_values = pd.get_dummies(pd.cut(values, bins))

print(f"\nCategorized Values: \n{cat_values}")

Array of values: 
[0.85583096 0.38204763 0.4648361  0.57742467 0.94388731 0.22337882
 0.21534198 0.56622647 0.72537081 0.64196748]

Categorized Values: 
   (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
0       False       False       False       False        True
1       False        True       False       False       False
2       False       False        True       False       False
3       False       False        True       False       False
4       False       False       False       False        True
5       False        True       False       False       False
6       False        True       False       False       False
7       False       False        True       False       False
8       False       False       False        True       False
9       False       False       False        True       False


## Extension Data Types

*Pandas extension data types*
|Extension type|Description|
|---|---|
|BooleanDtype |Nullable Boolean data, use "boolean" when passing as string|
|CategoricalDtype |Categorical data type, use "category" when passing as string|
|DatetimeTZDtype |Datetime with time zone|
|Float32Dtype |32-bit nullable floating point, use "Float32" when passing as string|
|Float64Dtype |64-bit nullable floating point, use "Float64" when passing as string|
|Int8Dtype| 8-bit nullable signed integer, use "Int8" when passing as string|
|Int16Dtype |16-bit nullable signed integer, use "Int16" when passing as string|
|Int32Dtype |32-bit nullable signed integer, use "Int32" when passing as string|
|Int64Dtype |64-bit nullable signed integer, use "Int64" when passing as string|
|UInt8Dtype |8-bit nullable unsigned integer, use "UInt8" when passing as string|
|UInt16Dtype |16-bit nullable unsigned integer, use "UInt16" when passing as string|
|UInt32Dtype |32-bit nullable unsigned integer, use "UInt32" when passing as string|
|UInt64Dtype |64-bit nullable unsigned integer, use "UInt64" when passing as string|

Extension types can be passed to the Series `astype()` method: `df["A"] = df["A"].astype("Float64")`.

On large datasets, string arrays like: `pd.Series(['one', 'two', None, 'three'], dtype=pd.StringDtype())` use much less memory and are more efficient.

Using pandas extension data type, e.g. pd.Int64Dtype, missing values will be pd.NA instead of np.nan: `pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())`.



## String Manipulation

### Python Built-In String Object Methods

*Python built-in string methods*
|Method|Description|
|---|---|
|count |Return the number of nonoverlapping occurrences of substring in the string|
|endswith |Return True if string ends with suffix|
|startswith |Return True if string starts with prefix|
|join |Use string as delimiter for concatenating a sequence of other strings|
|index |Return starting index of the first occurrence of passed substring if found in the string; otherwise, raises ValueError if not found|
|find |Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found|
|rfind |Return position of first character of last occurrence of substring in the string; returns –1 if not found|
|replace |Replace occurrences of string with another string|
|strip, rstrip, lstrip |Trim whitespace, including newlines on both sides, on the right side, or on the left side, respectively|
|split |Break string into list of substrings using passed delimiter|
|lower |Convert alphabet characters to lowercase|
|upper |Convert alphabet characters to uppercase|
|casefold |Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form|
|ljust, rjust |Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width|



In [14]:
import pandas as pd 
import numpy as np 

values = "a,b, c,   delta"

# combining split with strip to separate values and remove whitespaces 
pieces = [x.strip() for x in values.split(",")]
print(f"Values: {pieces}")

# Adding two-colon delimeter with 'join'
pieces = "::".join(pieces)
print(f"Values with delimeter: {pieces}")



Values: ['a', 'b', 'c', 'delta']
Values with delimeter: a::b::c::delta


### Regular Expressions

|Method|Description|
|---|---|
|findall |Return all nonoverlapping matching patterns in a string as a list|
|finditer |Like findall, but returns an iterator|
|match |Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, return a match object, and otherwise None|
|search |Scan string for match to pattern, returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning|
|split |Break string into pieces at each occurrence of pattern|
|sub, subn |Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, ... to refer to match group elements in the replacement string|

Creating a 'regex' object with `re.compile` will save CPU cycles if you intend to apply the same expression to many strings.



In [18]:
import re # regular expresions

regex_space = re.compile(r"\s+") # Regular expresion looking for whitespaces

text = "foo bar\t  baz    \tqux"

print(f"Printing 'text' clean:\n{regex_space.split(text)}")

text_mail = """
    Dave dave@google.com
    Steve steve@gmail.com
    Rob rob@gmail.com
    Ryan ryan@yahoo.com
"""

patt_mail = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
patt_mail_seg = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"


# extracting emails with previous pattern
# using 're.INGORECASE' to make regex case insensitive
regex_mail = re.compile(patt_mail, flags=re.IGNORECASE)

regex_mail_seg = re.compile(patt_mail_seg, flags=re.IGNORECASE)

print(f"\nFinding emails: \n{regex_mail.findall(text_mail)}")

print(f"\nFinding emails: \n{regex_mail_seg.findall(text_mail)}")

print(regex_mail_seg.sub(r"Username: \1, Domain: \2, Suffix: \3", text_mail))

Printing 'text' clean:
['foo', 'bar', 'baz', 'qux']

Finding emails: 
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

Finding emails: 
[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]

    Dave Username: dave, Domain: google, Suffix: com
    Steve Username: steve, Domain: gmail, Suffix: com
    Rob Username: rob, Domain: gmail, Suffix: com
    Ryan Username: ryan, Domain: yahoo, Suffix: com



### String Functions in Pandas

*Partial listing of Series string methods*
|Method|Description|
|---|---|
|cat |Concatenate strings element-wise with optional delimiter|
|contains |Return Boolean array if each string contains pattern/regex|
|count |Count occurrences of pattern|
|extract |Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group|
|endswith |Equivalent to x.endswith(pattern) for each element|
|startswith |Equivalent to x.startswith(pattern) for each element|
|findall |Compute list of all occurrences of pattern/regex for each string|
|get| Index into each element (retrieve i-th element)|
|isalnum |Equivalent to built-in str.alnum|
|isalpha| Equivalent to built-in str.isalpha|
|isdecimal| Equivalent to built-in str.isdecimal|
|isdigit| Equivalent to built-in str.isdigit|
|islower |Equivalent to built-in str.islower|
|isnumeric |Equivalent to built-in str.isnumeric|
|isupper |Equivalent to built-in str.isupper|
|join |Join strings in each element of the Series with passed separator|
|len |Compute length of each string|
|lower, upper |Convert cases; equivalent to x.lower() or x.upper() for each element|
|match |Use re.match with the passed regular expression on each element, returning True or False whether it matches|
|pad |Add whitespace to left, right, or both sides of strings|
|center |Equivalent to pad(side="both")|
|repeat| Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)|
|replace |Replace occurrences of pattern/regex with some other string|
|slice |Slice each string in the Series|
|split |Split strings on delimiter or regular expression|
|strip |Trim whitespace from both sides, including newlines|
|rstrip |Trim whitespace on right side|
|lstrip |Trim whitespace on left side|

Regular expressions can be passed using data.map but will fail with NA values. In this case we can use Seres methods.

In [20]:
import pandas as pd 
import numpy as np 

data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
        "Rob": "rob@gmail.com", "Wes": np.nan}

data = pd.Series(data)

print(f"Data: \n{data}")
print(f"\nNA values in data: \n{data.isna()}")

data_string = data.astype("string")

print(f"\nData as string Dtype: \n{data_string}")

patt_mail = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"

match_mail = data.str.findall(patt_mail, flags=re.IGNORECASE).str[0]

print(f"\n{match_mail}")

Data: 
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

NA values in data: 
Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

Data as string Dtype: 
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object


## Categorical Data

### Background and Motivation

*Categories* are the array of distincst values, and the integer values that reference the categories are *category codes*.


In [24]:
import pandas as pd 
import numpy as np 

values = pd.Series(["apple", "orange", "apple", "apple"] * 2)

print(f"Series values: \n{values}")

print(f"\nUnique values in values: {pd.unique(values)}")

print(f"\nValue counts: \n{values.value_counts()}")



Series values: 
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

Unique values in values: ['apple' 'orange']

Value counts: 
apple     6
orange    2
Name: count, dtype: int64


In [23]:
import pandas as pd 
import numpy as np 

# categories and category codes
codes = pd.Series([0, 1, 0, 0] * 2)
cats = pd.Series(["apple", "orange"])

print(f"Category codes: \n{codes}")
print(f"\nCategories: \n{cats}")

# 'take()' method to replace codes
print(f"\nCategories instead of codes: \n{cats.take(codes)}")

Category codes: 
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

Categories: 
0     apple
1    orange
dtype: object

Categories instead of codes: 
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object


### Categorical Extension Type in pandas

A popular data compresion technique is *encoding* (integer-based categorical representation) for data with many ocurrences of similar values. It can provide faster performance and lower memory use.

We can convert into categories if we have enconded data with `pandas.Categorical.from_codes()` passing *codes, categories* arguments as lists. Unless explicitly specified, categorical conversions assume no specific ordering. Using `from_codes()` or other constructor, you can indicate that the categories hava a meaningful ordering `pandas.Categorical.from_codes(codes, categories, ordered=True)`.



In [29]:
import pandas as pd 
import numpy as np 

fruits = ["apple", "orange", "apple", "apple"] * 2

N = len(fruits)

rng = np.random.default_rng(seed=255198)

df = pd.DataFrame({
    "fruit": fruits,
    "basket_id": np.arange(N),
    "count": rng.integers(3, 5, size=N),
    "weight": rng.uniform(0, 4, size=N)
    },
    columns=["basket_id", "fruit", "count", "weight"]
    )

print(f"Fruits DataFrame: \n{df}")

# Converting 'froit' column into categorical data type
fruit_cat = df["fruit"].astype("category")
# now 'fruit_cat' is an instance of 'pandas.Categorical' 
fc = fruit_cat.array 
print(f"\nType of fruit categories: {fc}")
print(f"\nFruit categories: {fc.categories}")
print(f"Fruit codes: {fc.codes}")

## Mapping between codes and categories:
print(f"\nCodes and categories: {dict(enumerate(fc.categories))}")

Fruits DataFrame: 
   basket_id   fruit  count    weight
0          0   apple      3  1.019423
1          1  orange      4  3.662935
2          2   apple      3  2.714591
3          3   apple      3  3.616318
4          4   apple      3  3.535980
5          5  orange      3  2.854480
6          6   apple      4  0.004824
7          7   apple      4  1.031763

Type of fruit categories: ['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

Fruit categories: Index(['apple', 'orange'], dtype='object')
Fruit codes: [0 1 0 0 0 1 0 0]

Codes and categories: {0: 'apple', 1: 'orange'}


In [31]:
import pandas as pd 
import numpy as np 

## Creating directly a categorical object
my_cats = pd.Categorical(["miu", "mia", "fiu", "mia", "miu"])

print(f"My categories: \n{my_cats}")

## Alternative constructor:
categories = ["foo", "bar", "baz"]
codes = [0, 1, 2, 0, 0, 1]

my_cats_2 = pd.Categorical.from_codes(codes, categories)
print(f"\nMy alternative categories: {my_cats_2}")
print(f"\nOrdering my alt-cat: \n{my_cats_2.as_ordered()}")

My categories: 
['miu', 'mia', 'fiu', 'mia', 'miu']
Categories (3, object): ['fiu', 'mia', 'miu']

My alternative categories: ['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

Ordering my alt-cat: 
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']


### Computations with Categoricals

Some parts of pandas performs better working with categoricals like the *groupby* function. `pandas.qcut()` returns pandas.Categorical.


In [33]:
import pandas as pd 
import numpy as np 

rng = np.random.default_rng(seed=255186)
draws = rng.standard_normal(1000)

# Generating 'bins' with 'qcut' and labeling 
bins = pd.qcut(draws, 4, labels=["Q1", "Q2", "Q3", "Q4"])

print(f"Data in bins: \n{bins}")
print(f"\nCodes: {bins.codes[:15]}")

bins = pd.Series(bins, name="quartile")
results = (pd.Series(draws)
           .groupby(bins)
           .agg(["count", "min", "max"])
           .reset_index())

print(f"\nResults: \n{results}")

Data in bins: 
['Q1', 'Q1', 'Q4', 'Q2', 'Q2', ..., 'Q2', 'Q3', 'Q4', 'Q3', 'Q1']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

Codes: [0 0 3 1 1 0 1 0 3 0 2 2 2 0 2]

Results: 
  quartile  count       min       max
0       Q1    250 -2.916361 -0.576712
1       Q2    250 -0.575496  0.017787
2       Q3    250  0.023344  0.724804
3       Q4    250  0.726007  2.804314


  .groupby(bins)


#### Better performance with categoricals

If we have a large number of entries with some categories as labels (string), we can convert this labels (e.g. pandas.Series) to a 'category' Dtype with `.astype()` method. In comparison, the default labels use significantly more memory than 'categories' Dtype. Also, groupby operations can be faster because the algorithm it use.

### Categorical Methods

*Categorical method for Series in pandas*
|Method|Description|
|---|---|
|add_categories |Append new (unused) categories at end of existing categories|
|as_ordered |Make categories ordered|
|as_unordered |Make categories unordered|
|remove_categories |Remove categories, setting any removed values to null|
|remove_unused_categories |Remove any category values that do not appear in the data|
|rename_categories |Replace categories with indicated set of new category names; cannot change the number of categories|
|reorder_categories |Behaves like rename_categories, but can also change the result to have ordered categories|
|set_categories |Replace the categories with the indicated set of new categories; can add or remove categories|

With pandas series containing categorical Dtype, the special *accessor* attribute `cat` provides access to categorical methods.

Categoricals are used as a convenient tool for memory savings and better performance, but in large datasets after filter the DataFrame or Series, many categories may not appear. We can use `remove_unused_categories` method to trim unobserved categories.

**Creating dummy variables for modeling** with `pandas.get_dummies` function which converts one dimensional categorical data into a DataFrame (1s for occurrences of a given category and 0s otherwise).

In [40]:
import pandas as pd 
import numpy as np 

ser = pd.Series(["a", "b", "c", "d"] * 2)
cat_ser = ser.astype("category")

print(f"Series codes:\n{cat_ser.cat.codes}")

act_cat = ["a", "b", "c", "d", "e"]

# Series with new 'actual' categories
cat_ser2 = cat_ser.cat.set_categories(act_cat)
print(f"\nActual categories: \n{cat_ser2}")
print(f"\nCount actual categories: \n{cat_ser2.value_counts()}")

cat_ser3 = cat_ser[cat_ser.isin(['a', 'b'])]
print(f"\nCategories from series 3: {cat_ser3.cat.categories}")
print(f"\nRemoving unused categories: \n{cat_ser3.cat.remove_unused_categories()}")


Series codes:
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

Actual categories: 
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

Count actual categories: 
a    2
b    2
c    2
d    2
e    0
Name: count, dtype: int64

Categories from series 3: Index(['a', 'b', 'c', 'd'], dtype='object')

Removing unused categories: 
0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']
