In [None]:
import pandas as pd
import warnings
import numpy as np

warnings.filterwarnings('ignore')

### Performing transformations on `survey_data`

#### Step 1. Importing data into environment.

In [None]:
survey_data = pd.read_csv(r"https://raw.githubusercontent.com/puneettrainer/datasets/main/survey_data.csv")

#### Step 2. Understanding the schema of the data.

This includes:
- understanding the number of rows and columns in the dataset
- understanding the data type of each field
- checking for null/non-standard values

In [None]:
survey_data.info()

Based on the `info()` method, there are 30 observations/rows (`entries`) and 13 columns in this dataset.

The index for this dataframe is the default index provided by pandas.

This data has:
- 2 `float64` columns: `Professional Experience` and `Notice Period.1`
- 2 `int64` columns: `Age` and `Response ID`
- remaining 9 columns are `object` type.
- 1 null value in the `Education`, `Profession`, `Last Active` and `Notice Period.1` field

#### Another way to get the number of null values in each field in the dataframe:

`dataframe_object.isna().sum()`

In [None]:
survey_data.isna().sum()

#### Step 3. Getting a preview of the data.

For this step we can use `head(n)`, `tail(n)` or `sample(n)` methods. The objective is to get `n` number of random rows to get an idea of how the data is recorded in the dataframe.

In [None]:
survey_data.sample(10)

Based on the data available in the dataframe:<br>
`Age` - stores the age of the candidate as a whole number<br>
`Gender`- stores the gender of the candidate as `Male` or `Female`<br>
`Date of Joining` - stores the date when the candidate joined TrainingYA<br>
`Courses Enrolled` - stores the number of courses enrolled in as an integer value<br>
`Courses Completed`- stores the number of courses completed as an integer value
<br>`Professional Experience` - stores the total number of years of professional experience
<br>`Education` - stores the general name of the highest education completed
<br>`Profession` - specifies whether the candidate is `Working` or a `Student`
<br>`Last Active` - stores the date when the candidate was last employed
<br>`Notice Period` - indicates whether the candidate is serving notice period or not
<br>`Notice Period.1` - stores the duration of notice period in terms of days
<br>`Industry` - stores the general name of the industry in which the candidate is employed in currently
<br>`Response ID` - unique identifier assigned to each candidate
- we need to convert:
    - `Date of Joining` from string object to a date type field
    - `Courses Enrolled` from string object to a whole number type field
    - `Courses Completed` from string object to a whole number type field
    - `Last Active` from string object to a date type field
    - `Notice Period` from a string object to a boolean field
- rename `Notice Period.1` to an understandable name

#### `dataframe_object.fillna(rep)`

The `fillna(rep)` method is used to replace NaN, NaT, etc. values with `rep`. This is useful when we are performing transformations on the data.

In [None]:
# replacing nan value with NA
survey_data.fillna('NA', inplace=True)

In [None]:
# checking whether Gender column contains any value other than Male or Female
survey_data['Gender'].unique()

In [None]:
# checking whether the Courses Enrolled column contains any non-numeric value
survey_data.loc[survey_data['Courses Enrolled'].str.contains('[^0-9]+', regex=True), :]

In [None]:
# assigning null where invalid value in Courses Enrolled
survey_data.loc[survey_data['Courses Enrolled'].str.contains('[^0-9]+', regex=True), 'Courses Enrolled'] = -1
survey_data['Courses Enrolled'] = survey_data['Courses Enrolled'].astype('int')
survey_data.loc[survey_data['Courses Enrolled'] == -1, 'Courses Enrolled'] = np.nan

In [None]:
# checking whether the Courses Completed column contains any non-numeric value
survey_data[survey_data['Courses Completed'].str.contains('[^0-9]+', regex=True)]

In [None]:
# assigning null where invalid value in Courses Completed
survey_data.loc[survey_data['Courses Completed'].str.contains('[^0-9]+', regex=True), 'Courses Completed'] = -1
survey_data['Courses Completed'] = survey_data['Courses Completed'].astype('int')
survey_data.loc[survey_data['Courses Completed'] == -1, 'Courses Completed'] = np.nan

In [None]:
# looking at distinct values in Education
survey_data['Education'].unique()

We need to standardize values in the the Education column as follows:

| Value | Standard Value |
| --- | --- |
| Master of Science (M.Sc.) | M.Sc. |
| M.Sc Data Science & Artificial Intelligence | M.Sc. |
| Bachelor Of Science | B.Sc. |
| Bachelor of science | B.Sc. |
| Bachelor's in science | B.Sc. |
| bachelors in science | B.Sc. |
| M.Tech | M.Tech. |
| B. Tech (ME) | B.Tech. |
| Btech.| B.Tech. |
| Btech(cse) | B.Tech. |
| Non-Technical (Masters) | Invalid |
| B.A | B.A. |
| Bachelor of commerce | B.Com. |
| post graduation | Invalid |
| Graduate | Invalid |
| Data Science | Invalid |
| nan | Invalid |
| Graduation | Invalid |
| MBA | M.B.A. |
| BCA | B.C.A. |

Based on the above table:
- if Education value contains pattern like m____sc___, then `M.Sc.`
- if Education value contains pattern like b_____sc____, then `B.Sc.`
- if Education value contains pattern like b____a___, then `B.A.`
- if Education value contains pattern like b_____tech____, then `B.Tech.`
- if Education value contains pattern like m____tech____, then `M.Tech.`
- if Education value contains pattern like b____com____, then `B.Com.`
- if Education = 'MBA' then `M.B.A.`
- if Education = 'BCA' then `B.C.A.`

In [None]:
# cleaining up Education column

# removing non-alphabetic characters
survey_data['Education'] = survey_data['Education'].str.lower().str.replace('[^a-z]+', '', regex=True).str.strip()

survey_data.loc[survey_data['Education'] == 'mba', 'Education'] = 'M.B.A.'
survey_data.loc[survey_data['Education'] == 'bca', 'Education'] = 'B.C.A.'

In [None]:
survey_data.loc[:, 'Education']

In [None]:
survey_data.loc[survey_data['Education'].str.contains('m[a-z]*sc[a-z]*', regex=True), 'Education']

In [None]:
# assigning correct value to Education column
survey_data.loc[survey_data['Education'].str.contains('m[a-z]*sc[a-z]*', regex=True), 'Education'] = 'M.Sc.'

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*sc[a-z]*', regex=True), 'Education']

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*sc[a-z]*', regex=True), 'Education'] = 'B.Sc.'

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*com[a-z]*', regex=True), 'Education']

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*com[a-z]*', regex=True), 'Education'] = 'B.Com.'

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*a[a-z]*', regex=True), 'Education']

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*a[a-z]*', regex=True), 'Education'] = 'B.A.'

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*tech[a-z]*', regex=True), 'Education']

In [None]:
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*tech[a-z]*', regex=True), 'Education'] = 'B.Tech.'

In [None]:
survey_data.loc[survey_data['Education'].str.contains('m[a-z]*tech[a-z]*', regex=True), 'Education'] = 'M.Tech.'

In [None]:
survey_data.loc[:, 'Education']

In [None]:
valid_edu = ['M.Tech.', 'B.Tech.', 'B.A.', 'M.B.A.', 'B.C.A.', 'M.Sc.', 'B.Sc.']
survey_data.loc[~survey_data['Education'].isin(valid_edu), 'Education'] = ''

In [None]:
survey_data.loc[:, 'Education']

In [None]:
# looking at distinct values in Profession
survey_data['Profession'].unique()

In [None]:
# looking at distinct values in Industry
survey_data['Industry'].unique()

Cleansing for the `Industry` column

| Value | Standard Value |
| --- | --- |
| BFSI | |
| Computer Application | Invalid |
| Healthcare | |
| Biotechnology Research
| BSC | Invalid |
| education | Education |
| DATA SCIENCE 360 | Invalid |
| Teaching | Education |
| BCA (computer application) | Invalid |
| STUDENT | Invalid |
| PHARMACEUTICAL | Pharmaceutical |
| HealthCare | Healthcare |
| Supply Chain management | Supply Chain Management |
| data science | Invalid |
| 360 programme | Invalid |
| Naval Architecture | |
| telecom | Telecommunications |
| Data Science 360 | Invalid |
| Insurance Domain and Finance | |
| BPO | |
| Bcom hons | Invalid |
| Data Science | Invalid |
| FMCG | |
| Data science | Invalid |
| B.com(hon) | Invalid |
| Applied Physics | Invalid |

In [None]:
# cleaining up Education column

# removing non-alphabetic characters
survey_data['Industry'] = survey_data['Industry'].str.lower().str.replace('[^a-z]+', '', regex=True).str.strip()

survey_data

In [None]:
survey_data

In the `Date of Joining` field, we can see there are two ways in which the date is provided:
- DD-MM-YYYY
- DD/MM/YYYY

There is no direct way to get the various formats in which the date may be entered. When developing an ETL framework, we can add checks ourselves to handle any type of date format and convert it into the correct standard date.

In [None]:
import datetime as dt
import re

for row in survey_data.index:
    if re.match('[0-9]{2}-[0-9]{2}-[0-9]{4}', survey_data.loc[row, 'Date of Joining']):
        survey_data.loc[row, 'Date of Joining'] = dt.datetime.strptime(survey_data.loc[row, 'Date of Joining'], '%d-%m-%Y')

    elif re.match('[0-9]{2}/[0-9]{2}/[0-9]{4}', survey_data.loc[row, 'Date of Joining']):
        survey_data.loc[row, 'Date of Joining'] = dt.datetime.strptime(survey_data.loc[row, 'Date of Joining'], '%m/%d/%Y')

    else:
        survey_data.loc[row, 'Date of Joining'] = np.nan

### ReGex Reference

`ReGex` is a pattern matching framework which allows manipulating strings based on how it "looks". It is case-sensitive so we need to take the required steps to make string manipulation easy (by converting to upper or lower case, etc.).

#### Syntax:

| Symbol | Indication |
| --- | --- |
| [] | indicates a pattern that will appear in a continuous sequence |
| () | allows grouping together character sets |
| ? | allows us to specify that part of the sequence is optional |
| {n} | indicates the number of characters (n) that will appear in the sequence together |
| * | indicates that zero or more characters will appear together |
| + | indicates that one or more characters will appear together |
| A-Z | indicates that the character(s) belong to the uppercase alphabet character set |
| a-z | indicates that the character(s) belong to the lowercase alphabet character set |
| 0-9 | indicates that the character(s) belong to the number character set |
| ^ | indicates that the following character(s) should not be in the string |


[ReGex Reference](https://docs.python.org/3/howto/regex.html#regex-howto)

In [None]:
survey_data['Date of Joining'] = pd.to_datetime(survey_data['Date of Joining'])
survey_data['Date of Joining']

In [None]:
# when the dates in a column are in the same format, we can directly use pd.to_datetime() to convert the entire column into a standard date column
survey_data['Last Active'] = pd.to_datetime(survey_data['Last Active'], format='%d-%m-%Y', errors='coerce')
survey_data['Last Active']

In [None]:
survey_data['Notice Period'] = survey_data['Notice Period'].map({'Yes':True
                                                                ,'No':False})
survey_data['Notice Period'].astype('bool')
survey_data.info()

In [None]:
# renmaing Notice Period.1 to Notice Period Duration
survey_data.rename(columns={'Notice Period.1':'Notice Period Duration'}, inplace=True)

# cleaning Notice Period Duration
survey_data['Notice Period Duration'] = survey_data['Notice Period Duration'].astype('str')
survey_data.loc[~survey_data['Notice Period Duration'].str.contains('([0-9]+.[0-9]+)', regex=True), 'Notice Period Duration'] = -1
survey_data['Notice Period Duration'] = survey_data['Notice Period Duration'].astype('float')
survey_data.loc[survey_data['Notice Period Duration'] == -1, 'Notice Period Duration'] = np.nan
survey_data['Notice Period Duration']

#### Try to standardize the `Industry` column

### Final script for cleansing on `survey_data`

In [None]:
import pandas as pd
import warnings
import numpy as np
import datetime as dt
import re

warnings.filterwarnings('ignore')

# importing data into pandas
survey_data = pd.read_csv(r"E:\data\survey_data.csv")

survey_data.fillna('NA', inplace=True)

# checking whether Gender column contains any value other than Male or Female
survey_data['Gender'].unique()

valid_gender = ['Male', 'Female', 'Other']
survey_data.loc[~survey_data['Gender'].isin(valid_gender), 'Gender'] = ''

# assigning null where invalid value in Courses Enrolled
survey_data.loc[survey_data['Courses Enrolled'].str.contains('[^0-9]+', regex=True), 'Courses Enrolled'] = -1
survey_data['Courses Enrolled'] = survey_data['Courses Enrolled'].astype('int')
survey_data.loc[survey_data['Courses Enrolled'] == -1, 'Courses Enrolled'] = np.nan

# assigning null where invalid value in Courses Completed
survey_data.loc[survey_data['Courses Completed'].str.contains('[^0-9]+', regex=True), 'Courses Completed'] = -1
survey_data['Courses Completed'] = survey_data['Courses Completed'].astype('int')
survey_data.loc[survey_data['Courses Completed'] == -1, 'Courses Completed'] = np.nan

# cleaining up Education column
# removing non-alphabetic characters and converting to lowercase to make checking easier
survey_data['Education'] = survey_data['Education'].str.lower().str.replace('[^a-z]+', '', regex=True).str.strip()

survey_data.loc[survey_data['Education'] == 'mba', 'Education'] = 'M.B.A.'
survey_data.loc[survey_data['Education'] == 'bca', 'Education'] = 'B.C.A.'
survey_data.loc[survey_data['Education'].str.contains('m[a-z]*sc[a-z]*', regex=True), 'Education'] = 'M.Sc.'
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*sc[a-z]*', regex=True), 'Education'] = 'B.Sc.'
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*com[a-z]*', regex=True), 'Education'] = 'B.Com.'
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*a[a-z]*', regex=True), 'Education'] = 'B.A.'
survey_data.loc[survey_data['Education'].str.contains('b[a-z]*tech[a-z]*', regex=True), 'Education'] = 'B.Tech.'
survey_data.loc[survey_data['Education'].str.contains('m[a-z]*tech[a-z]*', regex=True), 'Education'] = 'M.Tech.'

valid_education = ['M.Tech.', 'B.Tech.', 'B.A.', 'M.B.A.', 'B.C.A.', 'M.Sc.', 'B.Sc.']
survey_data.loc[~survey_data['Education'].isin(valid_education), 'Education'] = ''

# cleaning up Profession column
valid_profession = ['Student', 'Working']
survey_data.loc[~survey_data['Profession'].isin(valid_profession), 'Profession'] = ''

# cleaning up Industry column
valid_industry = ['bfsi', 'healthcare', 'research', 'education', 'pharmaceutical', 'supply chain management', 'insurance', 'telecommunications', 'insurance']
survey_data['Industry'] = survey_data['Industry'].str.lower().str.replace('[^a-z]+', '', regex=True).str.strip()
survey_data.loc[~survey_data['Industry'].isin(valid_industry), 'Industry'] = ''
survey_data['Industry'] = survey_data['Industry'].str.title()

# cleaning up Date of Joining column
for row in survey_data.index:
    if re.match('[0-9]{2}-[0-9]{2}-[0-9]{4}', survey_data.loc[row, 'Date of Joining']):
        survey_data.loc[row, 'Date of Joining'] = dt.datetime.strptime(survey_data.loc[row, 'Date of Joining'], '%d-%m-%Y')

    elif re.match('[0-9]{2}/[0-9]{2}/[0-9]{4}', survey_data.loc[row, 'Date of Joining']):
        survey_data.loc[row, 'Date of Joining'] = dt.datetime.strptime(survey_data.loc[row, 'Date of Joining'], '%m/%d/%Y')

    else:
        survey_data.loc[row, 'Date of Joining'] = np.nan

# converting Date of Joining from string to date data type
survey_data['Date of Joining'] = pd.to_datetime(survey_data['Date of Joining'])

# when the dates in a column are in the same format, we can directly use pd.to_datetime() to convert the entire column into a standard date column
survey_data['Last Active'] = pd.to_datetime(survey_data['Last Active'], format='%d-%m-%Y', errors='coerce')

# cleaning up Notice Period column
survey_data['Notice Period'] = survey_data['Notice Period'].map({'Yes':True
                                                                ,'No':False})

# converting type of Notice Period from string to boolean
survey_data['Notice Period'].astype('bool')

# renmaing Notice Period.1 to Notice Period Duration
survey_data.rename(columns={'Notice Period.1':'Notice Period Duration'}, inplace=True)

# # assigning null where invalid value in Notice Period Duration
survey_data['Notice Period Duration'] = survey_data['Notice Period Duration'].astype('str')
survey_data.loc[~survey_data['Notice Period Duration'].str.contains('([0-9]+.[0-9]+)', regex=True), 'Notice Period Duration'] = -1
survey_data['Notice Period Duration'] = survey_data['Notice Period Duration'].astype('float')
survey_data.loc[survey_data['Notice Period Duration'] == -1, 'Notice Period Duration'] = np.nan
survey_data