# CHAPTER 7
# Data Cleaning and Preparation
- Data preparation (loading, cleaning, transforming, rearranging) is reported to take up to 80% or more of an analyst's time.
- **pandas** along with buil-in Python features provide a high-level, flexible, and fast set of tools to manipulate data into the right form.

## Handling Missing Data
- Missing data occurs in many data analysis applications.
- **pandas** try to make working with missing data as painless as possible.
- For example all of the descriptive statistics on **pandas objects** exclude missing data by default.
- For numeric data, pandas uses the floating-point value **NaN (Not a Number)** to represent missing data - this is called a *sentinel value*.
- For other data types pandas uses **NA (not available)** to represent missing values. 
- In statistics applications, **NA data** may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). 
- When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Create a pandas Series of strings contaning one missing value
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
# Use isnull function to check for missing values
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [4]:
# The built-in Python None value is also treated as NA in object arrays

# Replace the first element in string_data with None
string_data[0] = None

# Check for missing values
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

**TABLE**: NA handling methods

| Argument                  | Description |
| :---                  |    :----    |
|dropna| Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
|fillna| Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
|isnull| Return boolean values indicating which values are missing/NA.
|notnull| Negation of isnull.

### Filtering Out Missing Data
- You can filter out missing data by hand using **pandas.isnull** and boolean indexing.
- Or you can use the **dropna** function. 
- On a Series, **dropna** returns the Series with only the non-null data and index values.
- With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. **dropna** by default drops any row containing a missing value.

In [5]:
# Create a Series that contains missing values
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [6]:
# Use dropna to remove the missing values
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [7]:
# Create a DataFrame with missing values
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
# Use dropna to remove row with missing values
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [9]:
# Passing how='all' will only drop rows that are all NA
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [10]:
# Create a 4th column with all values NA
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [11]:
# To drop columns in the same way, pass axis=1
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


- Suppose you want to keep only rows containing a certain number of observations. 
- You can indicate this with the **thresh** argument for **dropna**.

In [12]:
# Create a DataFrame with 7 rows and 3 columns
df = pd.DataFrame(np.random.randn(7, 3))

# Replace first 4 values for column 1 with NA values
df.iloc[:4, 1] = np.nan

# Replace the first 2 values for column 2 with NA values
df.iloc[:2, 2] = np.nan

df

Unnamed: 0,0,1,2
0,0.597607,,
1,0.573669,,
2,0.439961,,-0.031793
3,1.411603,,-0.977287
4,-1.359108,-1.133783,-0.602429
5,1.323524,-1.236592,-0.365065
6,0.00312,1.013268,-0.582838


In [13]:
# If we use dropnan with no arguments all rows with at least 1 NA value will be filtered
df.dropna()

Unnamed: 0,0,1,2
4,-1.359108,-1.133783,-0.602429
5,1.323524,-1.236592,-0.365065
6,0.00312,1.013268,-0.582838


In [14]:
# Using thresh argument we can indicate how many NA values in a row are allowed
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.439961,,-0.031793
3,1.411603,,-0.977287
4,-1.359108,-1.133783,-0.602429
5,1.323524,-1.236592,-0.365065
6,0.00312,1.013268,-0.582838


### Filling In Missing Data
- Rather than filtering out missing data you may want to fill in the “holes” in any number of ways. 
- For most purposes, the **fillna** method is the workhorse function to use.

In [15]:
# Calling fillna with a constant replaces missing values with that value
df.fillna(0)

Unnamed: 0,0,1,2
0,0.597607,0.0,0.0
1,0.573669,0.0,0.0
2,0.439961,0.0,-0.031793
3,1.411603,0.0,-0.977287
4,-1.359108,-1.133783,-0.602429
5,1.323524,-1.236592,-0.365065
6,0.00312,1.013268,-0.582838


In [16]:
# Calling fillna with a dict, you can use a different fill value for each column
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.597607,0.5,0.0
1,0.573669,0.5,0.0
2,0.439961,0.5,-0.031793
3,1.411603,0.5,-0.977287
4,-1.359108,-1.133783,-0.602429
5,1.323524,-1.236592,-0.365065
6,0.00312,1.013268,-0.582838


In [17]:
# fillna returns a new object, but you can modify the existing object in-place
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.597607,0.0,0.0
1,0.573669,0.0,0.0
2,0.439961,0.0,-0.031793
3,1.411603,0.0,-0.977287
4,-1.359108,-1.133783,-0.602429
5,1.323524,-1.236592,-0.365065
6,0.00312,1.013268,-0.582838


In [18]:
# Create a new DataFrame
df2 = pd.DataFrame(np.random.randn(6, 3))

# Insert some missing values
df2.iloc[2:, 1] = np.nan
df2.iloc[4:, 2] = np.nan

df2

Unnamed: 0,0,1,2
0,-0.184676,1.208067,0.344193
1,1.743529,-0.482224,-1.34897
2,0.105603,,0.284715
3,1.723175,,0.388434
4,0.310821,,
5,-0.344407,,


In [19]:
# The same interpolation methods available for reindexing can be used with fillna
df2.fillna(method='ffill')

# 'ffill' = forward fill method

Unnamed: 0,0,1,2
0,-0.184676,1.208067,0.344193
1,1.743529,-0.482224,-1.34897
2,0.105603,-0.482224,0.284715
3,1.723175,-0.482224,0.388434
4,0.310821,-0.482224,0.388434
5,-0.344407,-0.482224,0.388434


In [20]:
# Create a new Series with missing values
data = pd.Series([1., np.nan, 3.5, np.nan, 7])

# For example you might pass the mean or median value of a Series
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

**TABLE**: *fillna* function arguments

| Argument                  | Description |
| :---                  |    :----    |
|value| Scalar value or dict-like object to use to fill missing values
|method| Interpolation; by default 'ffill' if function called with no other arguments
|axis| Axis to fill on; default axis=0
|inplace| Modify the calling object without producing a copy
|limit| For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation
### Removing Duplicates

In [21]:
# Create a DataFrame containing duplicates
df3 = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1, 1, 2, 3, 3, 4, 4]})
df3

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [22]:
# The DataFrame method duplicated returns a boolean Series indicating whether each
# row is a duplicate (has been observed in a previous row) or not

df3.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [23]:
# drop_duplicates returns a DataFrame where the duplicated array is False
df3.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [24]:
# Filter duplicates only based on the 'k1' column
df3.drop_duplicates(['k1'])

Unnamed: 0,k1,k2
0,one,1
1,two,1


In [25]:
# duplicated and drop_duplicates by default keep the first observed value combination
# Passing keep='last' will return the last one

df3.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
6,two,4


### Transforming Data Using a Function or Mapping
- For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.

In [26]:
# Create a DataFrame with data about various kinds of meats
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [27]:
# Suppose you wanted to add a column indicating the type of animal that each food came from

# Create a dict to map each meat to the animal
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

# The map method on a Series accepts a function or dict-like object containing a mapping

In [28]:
# First we need to convert each value from 'food' column to lowercase using the str.lower
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [29]:
# Nowe we can use the Series map method to create an extra column called 'animal'
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [30]:
# Doing the same thing using a lambda function
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values
- Filling in missing data with the **fillna** method is a special case of more general value replacement. 
- **map** can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so.

In [31]:
# Create a Series
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [32]:
# We can use replace to modify certail values
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [33]:
# replace multiple values at once
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [34]:
# To use a different replacement for each value, pass a list of substitutes
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes
- Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. 
- You can also modify the axes in-place without creating a new data structure.

In [35]:
# Create a DataFrame
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [36]:
# Like a Series, the axis indexes have a map method
data.index = data.index.map(lambda x: x[:4].upper())
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [37]:
# If you want to create a transformed version of a dataset without modifying the original, 
# a useful method is rename
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


### Discretization and Binning
- Continuous data is often discretized or otherwise separated into “bins” for analysis.

In [38]:
# Suppose you have data about a group of people and you want to group
# them into discrete age buckets

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [39]:
# Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older
bins = [18, 25, 35, 60, 100]

In [40]:
# To create the actual bins for the data we can use the pandas.cut function
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

**pandas.cut**:
- The object pandas returns is a special **Categorical object**. 
- The output describes the bins computed by **pandas.cut** 
- It contains a categories array specifying the distinct **category names** along with a labeling for the ages data in the **codes attribute**.
- **pd.value_counts(cats)** are the bin counts for the result of **pandas.cut**
- You can change which side is closed by passing **right=False**.

In [41]:
# Check the codes
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [42]:
# Check the categories
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [43]:
# Check the bin count
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [44]:
# You can also pass your own bin names by passing a list or array to the labels option
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [45]:
# If you pass an integer number of bins to cut instead of explicit bin edges, it will compute 
# equal-length bins based on the minimum and maximum values in the data

data = np.random.randint(20, size=20)
data

array([15,  1, 19, 13,  5, 13, 13, 17,  1,  6,  3, 17,  7,  2, 12, 11,  9,
       18,  7, 14])

In [46]:
# Create 4 bins of equal-length
cats = pd.cut(data, 4, precision=0)
cats

[(14.0, 19.0], (1.0, 6.0], (14.0, 19.0], (10.0, 14.0], (1.0, 6.0], ..., (10.0, 14.0], (6.0, 10.0], (14.0, 19.0], (6.0, 10.0], (10.0, 14.0]]
Length: 20
Categories (4, interval[float64]): [(1.0, 6.0] < (6.0, 10.0] < (10.0, 14.0] < (14.0, 19.0]]

In [47]:
# Count the number of values in each bin
pd.value_counts(cats)

(10.0, 14.0]    6
(14.0, 19.0]    5
(1.0, 6.0]      5
(6.0, 10.0]     4
dtype: int64

- A closely related function, **qcut**, bins the data based on sample **quantiles**. 
- Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. 
- Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [48]:
# Create a sample of normally distributed numbers
data = np.random.randn(1000)

# Cut into quartiles
cats = pd.qcut(data, 4) 
cats

[(-0.083, 0.64], (-0.752, -0.083], (-2.792, -0.752], (0.64, 3.039], (-0.083, 0.64], ..., (-0.083, 0.64], (0.64, 3.039], (-0.083, 0.64], (0.64, 3.039], (-0.752, -0.083]]
Length: 1000
Categories (4, interval[float64]): [(-2.792, -0.752] < (-0.752, -0.083] < (-0.083, 0.64] < (0.64, 3.039]]

In [49]:
# Count the number of values in each bin
pd.value_counts(cats)

(0.64, 3.039]       250
(-0.083, 0.64]      250
(-0.752, -0.083]    250
(-2.792, -0.752]    250
dtype: int64

### Detecting and Filtering Outliers
- Filtering or transforming outliers is largely a matter of applying array operations.

In [50]:
# Consider a DataFrame with some normally distributed data
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.022645,0.071479,-0.02573,0.029736
std,1.013827,0.991613,0.995021,0.974804
min,-3.442751,-2.822571,-2.710805,-4.19773
25%,-0.607977,-0.643821,-0.697638,-0.579908
50%,0.036396,0.070334,-0.026486,0.076245
75%,0.645012,0.740741,0.620889,0.686545
max,3.368127,3.423619,3.486825,2.663907


In [51]:
# Find values in one of the columns exceeding 3 in absolute value
col = data[2]
col[np.abs(col) > 3]

177    3.486825
234    3.133377
486    3.245176
Name: 2, dtype: float64

In [52]:
# To select all rows having a value exceeding 3 or –3, you can use the any method on a
# boolean DataFrame

data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
86,0.292316,3.423619,-1.339635,0.349397
117,3.215051,0.474165,0.642041,0.590939
177,0.046093,-1.532825,3.486825,1.638403
226,-3.10398,-1.859655,0.270896,0.371081
234,-1.535043,0.362419,3.133377,2.056157
245,-0.598176,3.324558,1.176636,0.768478
422,1.319152,-1.221692,-0.213539,-3.186004
428,3.082042,0.219457,-0.803165,-0.606293
486,0.477619,-0.828317,3.245176,0.307169
535,3.237854,-0.417267,0.060325,1.256245


### Permutation and Random Sampling
- Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the **numpy.random.permutation** function. 
- Calling **permutation** with the length of the axis you want to permute produces an array of integers indicating the new ordering.

In [53]:
# Create a DataFrame
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [54]:
# Use the permutation function to create a sampler array
sampler = np.random.permutation(5)
sampler

array([2, 0, 1, 3, 4])

In [55]:
# Use the sampler array as input for take function = Return the elements in the given 
# positional indices along an axis
df.take(sampler)

Unnamed: 0,0,1,2,3
2,8,9,10,11
0,0,1,2,3
1,4,5,6,7
3,12,13,14,15
4,16,17,18,19


In [56]:
# To select a random subset without replacement, you can use the sample method
df.sample(n=3)

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
4,16,17,18,19


### Computing Indicator/Dummy Variables
- Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a **dummy** or **indicator matrix**. 
- If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. 
- pandas has a **get_dummies** function for doing this, though devising one yourself is not difficult.

In [57]:
# Create a DataFrame
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [58]:
# Use get_dummies on key column function to derive a matrix with 3 columns
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [59]:
# get_dummies has a prefix argument if you want to add a prefix to the columns in the 
# indicator DataFrame

# Create the dummy matrix with prefix
dummies = pd.get_dummies(df['key'], prefix='key')

# Join the dummy matrix with the original data
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


**EXAMPLE**: MovieLens 1M dataset

In [61]:
# Define name for the columns
mnames = ['movie_id', 'title', 'genres']

# Read the data from movies.dat file
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames)

movies.head(10)

  return read_csv(**locals())


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [62]:
# Since a row in the DataFrame belongs to multiple categories
# Some wrangling is necessary to get the dummy matrix

# First, we extract the list of unique genres in the dataset
all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))

genres = pd.unique(all_genres)

genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [63]:
# We start with a DataFrame of all zeros
zero_matrix = np.zeros((len(movies), len(genres)))
zero_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [65]:
# Rename the columns using the list of unq genres we created before
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
# Select first element as an example
gen = movies.genres[0]

# Split the text based on '|'
gen.split('|')

['Animation', "Children's", 'Comedy']

In [67]:
# Use the dummies.columns to compute the column indices for each genre
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

In [69]:
# We can use .iloc to set values based on these indices:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [71]:
# Combine this with movies
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[5]

movie_id                                 6
title                          Heat (1995)
genres               Action|Crime|Thriller
Genre_Animation                          0
Genre_Children's                         0
Genre_Comedy                             0
Genre_Adventure                          0
Genre_Fantasy                            0
Genre_Romance                            0
Genre_Drama                              0
Genre_Action                             1
Genre_Crime                              1
Genre_Thriller                           1
Genre_Horror                             0
Genre_Sci-Fi                             0
Genre_Documentary                        0
Genre_War                                0
Genre_Musical                            0
Genre_Mystery                            0
Genre_Film-Noir                          0
Genre_Western                            0
Name: 5, dtype: object

## String Manipulation
### String Object Methods
- In many string munging and scripting applications, built-in string methods are sufficient.

In [72]:
# As an example, a comma-separated string can be broken into pieces with split
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

In [73]:
# split is often combined with strip to trim whitespace (including line breaks)
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [74]:
# These substrings could be concatenated together with a two-colon delimiter using addition
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [75]:
# But this isn’t a practical generic method. A faster and more Pythonic way is to pass a
# list or tuple to the join method on the string '::'
'::'.join(pieces)

'a::b::guido'

In [76]:
# Using Python’s in keyword is the best way to detect a substring
'guido' in val

True

In [80]:
# Get the index of a certain substring
val.index(',')

# index raises an exception if the substring isn’t found

1

In [81]:
# You can also use find to get the index of asub string
val.find('a')

# find return -1 if a substring isn't found

0

In [82]:
# count returns the number of occurrences of a particular substring
val.count(',')

2

In [83]:
# replace will substitute occurrences of one pattern for another
# It is commonly used to delete patterns, too, by passing an empty string
val.replace(',', '')

'ab guido'

**TABLE**: Python built-in string methods

| Argument                  | Description |
| :---                  |    :----    |
|count| Return the number of non-overlapping occurrences of substring in the string.
|endswith| Returns True if string ends with suffix.
|startswith| Returns True if string starts with prefix.
|join| Use string as delimiter for concatenating a sequence of other strings.
|index |Return position of first character in substring if found in the string; raises ValueError if not found.
|find| Return position of first character of rst occurrence of substring in the string; like index, but returns –1 if not found.
|rfind| Return position of first character of last occurrence of substring in the string; returns –1 if not found.
|replace| Replace occurrences of string with another string.
|strip, rstrip, lstrip| Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
|split| Break string into list of substrings using passed delimiter.
|lower| Convert alphabet characters to lowercase.
|upper| Convert alphabet characters to uppercase.
|casefold| Convert characters to lowercase, and convert any region-specific variable character  ombinations to a common comparable form.
|ljust, rjust| Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

### Regular Expressions
- Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. 
- A single expression, commonly called a **regex**, is a string formed according to the regular expression language. 
- Python’s built-in **re module** is responsible for applying regular expressions to strings.
- The **re module** functions fall into three categories: pattern matching, substitution, and splitting. 
- A **regex** describes a pattern to locate in the text, which can then be used for many purposes.

In [85]:
# Import the re module
import re

# Define a string with variable number of whitespace characters (tabs, spaces, and newlines)
text = "foo bar\t baz \tqux"

In [87]:
# To remvoe the white spaces you can use the regex describing one or more whitespace 
# characters \s+ combined with re.split method
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [88]:
# If you want to get a list of all patterns matching the regex, you can use the findall method
re.findall('\s+', text)

[' ', '\t ', ' \t']

**Compile Regex**:
- When you call **re.split('\s+', text)**, the regular expression is first compiled, and then its split method is called on the passed text. 
- You can compile the regex yourself with **re.compile**, forming a reusable regex object.
- Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.

In [89]:
# Let’s consider a block of text and a regular expression capable of identifying 
# most email addresses

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [90]:
# Using findall on the text produces a list of the email addresses
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [91]:
# search returns a special match object for the first email address in the text
regex.search(text)

<re.Match object; span=(5, 20), match='dave@google.com'>

In [93]:
# regex.match returns None, as it only will match if the pattern 
# occurs at the start of the string
print(regex.match(text))

None


In [94]:
# sub will return a new string with occurrences of the pattern replaced by the
# a new string
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



**TABLE**: Regular expression methods

| Argument                  | Description |
| :---                  |    :----    |
|findall| Return all non-overlapping matching patterns in a string as a list
|finditer| Like findall, but returns an iterator
|match| Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
|search| Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
|split| Break string into pieces at each occurrence of pattern
|sub, subn| Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, ... to refer to match group elements in the replacement string