# Data Modification/Manipulation

- Data Modification & Manipulation refer to the process of selecting and processing of desired data.

- It is very useful and help converts data into useful format that is suitable for analysis.

- It gives opportunity to manipulate data as per your needs like  `   
    - feature_engineering, (creating new cols)
    - data modification
    - data filtering
    - droping data
    - Replacing values
    - Data Sampling
    & More!
    

# Loading Libraries & Data

In [None]:
# loading data
import pandas as pd
import numpy as np

In [None]:
# dataframe with 10 rows 5 cols with row index and column index
df = pd.DataFrame(
    np.random.rand(10,5),
    index = list("abcdefghif"),
    columns = list("ABCDE"))

In [None]:
df.head()

# Create New Columns Based on Earlier Column

Very useful for creating polynomial features easily

In [None]:
# create a new column E such that its 50 times A
df['E'] = df['A']*50

df.head()

SYNTAX: `df[new_col_name] = df['req_col_name']*operation`

This is also subset of feature engineering in ML

In [None]:
# create a new column F which is sum of A,C, E column
df['F'] = df['A']+df['C']+ df['E']
df.head()

In [None]:
0.158179+0.094932+7.908974 # same as col F

# Modifying Existing Columns

STEPS:
- Create a criteria
- Use loc to verify you index into right column as (`df.loc[:, 'col_name']`)
- Apply the loc on criteria and column name with col value wish to replace as (`df.loc[criteria, 'col_name'] = processing`)

In [30]:
df.head()

Unnamed: 0,A,B,C,D,E,F
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252


In [None]:
# modify column B to have only values that are greater than 0.2

# create a criteria first
criteria = df['B']>0.2

criteria

In [None]:
# create a copy of dataframe - optional
df_new = df.copy(deep = True)

df_new.head()

In [None]:
# check B column value, to make sure .loc is indexing right values
df_new.loc[:, 'B']

In [None]:
# set all True values to 0, rest keep as it is

# find location where the criteria is true (B>0.2)
# set value of column B
df_new.loc[criteria , 'B'] = 1

In [None]:
df_new

In [31]:
# creating a copy

# the change is permanent or in place, so preserve orignal.
city_df = df.copy()

city_df.head()

Unnamed: 0,A,B,C,D,E,F
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252


In [32]:
# list of cities to add
# elements should be same no as of rows

cities_list = ['Mumbai', 'Delhi', 'Chennai',
              'Kolkata', 'Bengalure',
              'Pune', 'Ahmedabad', 'Patna', 'Indore', 'Coimbatore']


# adding / append a new column along with all row values
city_df['city'] = cities_list

In [33]:
city_df.head()

Unnamed: 0,A,B,C,D,E,F,city
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure


**optional**

In [35]:
# if element values less than the total no of rows for some reason, don' leave it blank , include NA, else error will be thrown

cities_list_miss = ['Mumbai', 'Delhi', 'Chennai',
              'Kolkata', 'Bengalure', 'NA',
              'Ahmedabad', 'Patna', 'Indore', 'Coimbatore']


city_df['CITY'] = cities_list_miss


city_df

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore,Indore
f,0.243107,0.960019,0.512547,0.538527,12.155359,12.911014,Coimbatore,Coimbatore


NOTES:
- The NA you provided will not count towards NA count using `df.isna().sum()` method

- The values in new column will be in same sequence as provided in the list - order is important

In [36]:
city_df['CITY'].isna().sum()

0

# Filter Categorical Columns

* Earlier the column values were numerical - for setting criteria
* What if the column values are categorical?

STEPS
* Create a criteria using `isin` method - returns a series of bools
* Apply it on the Data Frame.

In [40]:
# create a criteria
criteria = city_df['CITY'].isin(['NA'])
criteria

a    False
b    False
c    False
d    False
e    False
f     True
g    False
h    False
i    False
f    False
Name: CITY, dtype: bool

SYNTAX: `df[col_name].isin([elements])`

In [41]:
# apply on data frame using loc
city_df.loc[criteria, :]

Unnamed: 0,A,B,C,D,E,F,city,CITY
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,


In [44]:
# working with multiple columns using and condition - create multiple criterias and merge them in loc

# only select colums which have CITY as Mumbai , Banglore & Chennai and city have Mumbai & Chennai

criteria_1 = city_df['CITY'].isin(['Mumbai', 'Bengalure', 'Chennai'])
criteria_2 = city_df['city'].isin(['Mumbai', 'Chennai'])

city_df.loc[criteria_1 & criteria_2, :]

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai


# Drop Un-necessary Column

Often times we want to get rid of un-necessary columns, so we drop them

In [46]:
# create a new data frame for safety
drop_df = city_df.copy()
drop_df

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore,Indore
f,0.243107,0.960019,0.512547,0.538527,12.155359,12.911014,Coimbatore,Coimbatore


In [47]:
# drop the city Column

# bad way
# drop_df.drop(['city'], axis = 1)

# good way
drop_df.drop(columns = ['city'], inplace = True)

In [48]:
drop_df

Unnamed: 0,A,B,C,D,E,F,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore
f,0.243107,0.960019,0.512547,0.538527,12.155359,12.911014,Coimbatore


WHATS THE DIFFERENCE:

`drop_df.drop(['city'], axis = 1)`

- Specify that drop by column using `axis = 1` parameter

- By default returns a dropped copy, so need to re-assing to a variable to see effect.

`drop_df.drop(columns = ['city'], inplace = True)`

- More natural and explicit way of saying drop by column name.

- Reassingment not needed as `inplace = True` is provided

# Replacing Specific Value

- Often times data have errors in the values, due to wrong input by data entry operator

- One need to fix those values.
- Very important to make data useful.

In [49]:
# for security
replace_df = city_df.copy()

replace_df.head(8)

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna


In [50]:
# change CITY's NA value to Pune
replace_df.loc[replace_df['CITY'] == 'NA', ['CITY']] = 'Pune'

UNDERSTAND WHATS HAPPENING:

1. `replace_df.loc` - Returns index of all the arrays

2. `replace_df['CITY'] == 'NA'` - Check which all rows in CITY column have value NA. Returns a boolean series

3. `['CITY'] = 'Pune'` - Converts the True value to Pune

SYNTAX:
```
df.loc[df[col_name] == search_value, [col_names]] = value
```

In [51]:
replace_df

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,Pune
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore,Indore
f,0.243107,0.960019,0.512547,0.538527,12.155359,12.911014,Coimbatore,Coimbatore


# Replace All Rows With Same Value

* Sometime you may need to convert all entries into same value

* Simple way to do is using loc

In [52]:
change_df = city_df.copy()
change_df.head()

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure


In [53]:
# create a new column
change_df['STATE'] = ['Maharashtra', 'Delhi', 'Tamil Nadu',
                     'West Bengal', 'Karnataka', 'Maharashtra',
                     'Gujarat', 'Bihar', 'Madhya Pradesh', 'Tamil Nadu']
change_df.head()

Unnamed: 0,A,B,C,D,E,F,city,CITY,STATE
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai,Maharashtra
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai,Tamil Nadu
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata,West Bengal
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure,Karnataka


Now for some reason, I want to change the STATE column values to only Gujarat,

Here are the steps:
- Use row index for all rows and specific city
- Apply assingment operator on it with desired value

SYNTAX: `df.loc[:, 'col_name'] = replace_value`

In [54]:
# for all rows of STATE change the value to GUJRAT
change_df.loc[:, 'STATE'] = 'Gujarat'

In [55]:
change_df.head()

Unnamed: 0,A,B,C,D,E,F,city,CITY,STATE
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai,Gujarat
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi,Gujarat
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai,Gujarat
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata,Gujarat
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure,Gujarat


In [56]:
# obiviouly no need of STATE , so drop it
change_df.drop(columns = ['STATE'], inplace = True)

In [57]:
change_df

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore,Indore
f,0.243107,0.960019,0.512547,0.538527,12.155359,12.911014,Coimbatore,Coimbatore


# Sample Data

* Often times you need to work on a sample of data and then hypothesise on the population.

* Finally check wether your hypothesis works on population data or not.

* Its a quick way to understand nature of data.

* Data sampling also help in creating randomeness, so model don't learn the data (overfit)

* Use `df.sample` method

In [58]:
sample_df = city_df.copy()
sample_df.head()

Unnamed: 0,A,B,C,D,E,F,city,CITY
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure


In [60]:
help(df.sample) # for documentation

Help on method sample in module pandas.core.generic:

sample(n: 'int | None' = None, frac: 'float | None' = None, replace: 'bool_t' = False, weights=None, random_state: 'RandomState | None' = None, axis: 'Axis | None' = None, ignore_index: 'bool_t' = False) -> 'Self' method of pandas.core.frame.DataFrame instance
    Return a random sample of items from an axis of object.
    
    You can use `random_state` for reproducibility.
    
    Parameters
    ----------
    n : int, optional
        Number of items from axis to return. Cannot be used with `frac`.
        Default = 1 if `frac` = None.
    frac : float, optional
        Fraction of axis items to return. Cannot be used with `n`.
    replace : bool, default False
        Allow or disallow sampling of the same row more than once.
    weights : str or ndarray-like, optional
        Default 'None' results in equal probability weighting.
        If passed a Series, will align with target object on index. Index
        values in weight

In [63]:
# take a sample of 3 observation randomly
sample_df.sample(n = 3)

Unnamed: 0,A,B,C,D,E,F,city,CITY
c,0.172841,0.791032,0.27045,0.601542,8.642026,9.085317,Chennai,Chennai
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure


In [64]:
# adding seed for reproducibility
sample_df.sample(n = 5, random_state = 42)

Unnamed: 0,A,B,C,D,E,F,city,CITY
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore,Indore
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna


In [65]:
sample_df.sample(n = 5, random_state = 42)

Unnamed: 0,A,B,C,D,E,F,city,CITY
i,0.633427,0.343544,0.004032,0.919857,31.671368,32.308828,Indore,Indore
b,0.81777,0.590928,0.034701,0.727542,40.8885,41.740972,Delhi,Delhi
f,0.204819,0.857769,0.141291,0.59033,10.24096,10.587071,Pune,
a,0.158179,0.203376,0.094932,0.833074,7.908974,8.162085,Mumbai,Mumbai
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna


In [66]:
# sampling by  replacement - same row can repeat in sample
sample_df.sample(n = 5, random_state = 42, replace = True)

# replace = False , by default

Unnamed: 0,A,B,C,D,E,F,city,CITY
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
d,0.277957,0.003704,0.602578,0.528036,13.897874,14.778409,Kolkata,Kolkata
h,0.941735,0.297764,0.206632,0.503906,47.086773,48.235141,Patna,Patna
e,0.794288,0.433769,0.882543,0.936211,39.714421,41.391252,Bengalure,Bengalure
g,0.688475,0.304323,0.782496,0.340568,34.423733,35.894703,Ahmedabad,Ahmedabad
