Last Update: 9th October 2023

Status: No pending updates, ready to use

---

In [89]:
import pandas as pd
import numpy as np
import re

#### 1. Read CSV

<u>Note:</u>
1. To avoid any filepath issue, it is best to keep this "ipynb" file with the Excel file that you wish to read in the same folder.
2. This Python script only supports CSV file, convert the Excel file into CSV UTF-8.

<u>Additional Notes:</u>
1. If there are any error regarding the column data type (DtypeWarning), it is usually due to mixed datatype in one column or pandas unable to read the datetime format. To fix this issue, you will need to define additional syntax in the `pd.read_csv()` syntax.
    
    Sample Code:
    
    ```python
        df = pd.read_csv(excel_file, parse_dates= ["ColumnName"], dtype= {"ColumnName": "Dtype"})
    ```
    
    - The `parse_dates=` argument assists in reading datetime columns that cannot be processed by pandas in its default configuration.
    - The `dtype` argument takes in dictionary as shown in the sample code above. Datatype(s) will be apply to either the whole dataset or the specified columns. E.g., `{'a': np.float64, 'b': np.int32, 'c': 'Int64'}`. 

    Click on the link for the full list of datatype in pandas. [Full List of Datatype in Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes)
1. Remember to change all the datetime variable into "datetime64[ns]" in Section 2.

In [90]:
#Update the file name
excel_file = "Test_Initial_Data_Exploration"

df = pd.read_csv(excel_file + ".csv")

  df = pd.read_csv(excel_file + ".csv")


#### Uncomment the following section if no DtypeWarning

In [91]:
# determine the columns that have warning (Insert the column index based on the warning)

# replace the column index in list
display(df.iloc[:,[29,33,40]].head())

# store the list of column for datetime column
date_col = df.iloc[:,[29,33]].columns.to_list()

# create data dictionary for dtype
data_dict = {"Currently in workflow" : "boolean"}

Unnamed: 0,First marketing email reply date,Last marketing email reply date,Currently in workflow
0,,,True
1,,,
2,,,True
3,,,False
4,,,False


In [92]:
# read the csv again with parse and dtype
df = pd.read_csv(excel_file + ".csv", parse_dates= date_col, dtype= data_dict)

#### Rename Column to remove symbol or blank

In [93]:
# rename the column to ensure that there are no blank, (), /, :
df_col = df.columns
replace_pattern = r'[ ()\/:]+'

df_col_rename = {col : re.sub(replace_pattern, "_", col) for col in df_col}
df.rename(columns= df_col_rename, inplace= True)

#### 2. Inspect each column and update the column type if necessary

<u>Note:</u>
1. When there are mixed datatype found in the same column, pandas usually will identify the column as object due to the fact that pandas treat blank as integer. The following are the combination that are commonly found:
    - A column with strings/boolean and blank value (NaN)
    - A column with strings/boolean and integer
2. You can pass a dictionary into `.astype()` to change the data type of multiple columns at once.

In [94]:
df.dtypes

Record_ID                                             int64
First_Name                                           object
Last_Name                                            object
Email                                                object
Job_function                                        float64
Job_Title                                            object
Industry                                             object
Industry_Segmentation                                object
Company_Name                                         object
Company_Name_Clean_                                  object
Create_Date                                          object
Original_Source                                      object
Country_Region                                       object
Last_Engagement_Date                                 object
Last_Activity_Date                                   object
Last_Contacted                                       object
Last_Modified_Date                      

In [95]:
# The following is just an example, update the dictionary according to the dataset
dt_dict = {'First_Name': 'string', 'Last_Name': 'string', 'Create_Date': 'datetime64[ns]', 'Last_Engagement_Date': 'datetime64[ns]',
            'Last_Activity_Date': 'datetime64[ns]', 'Last_Contacted': 'datetime64[ns]', 'Last_Modified_Date': 'datetime64[ns]',
            'First_marketing_email_click_date': 'datetime64[ns]', 'First_marketing_email_open_date': 'datetime64[ns]',
            'First_marketing_email_reply_date': 'datetime64[ns]', 'First_marketing_email_send_date': 'datetime64[ns]',
            'Last_marketing_email_open_date': 'datetime64[ns]', 'Last_marketing_email_reply_date': 'datetime64[ns]',
            'Last_marketing_email_send_date': 'datetime64[ns]'}
df = df.astype(dt_dict)
df.dtypes

Record_ID                                             int64
First_Name                                           string
Last_Name                                            string
Email                                                object
Job_function                                        float64
Job_Title                                            object
Industry                                             object
Industry_Segmentation                                object
Company_Name                                         object
Company_Name_Clean_                                  object
Create_Date                                  datetime64[ns]
Original_Source                                      object
Country_Region                                       object
Last_Engagement_Date                         datetime64[ns]
Last_Activity_Date                           datetime64[ns]
Last_Contacted                               datetime64[ns]
Last_Modified_Date                      

#### 3. Descriptive Statistics

<u>Note:</u>

The following are the fields for the descriptive statistics:
1. Data Type
1. Total
1. Non Missing Value
1. Missing Value
1. Missing Percentage %
1. Min
1. Max
1. Mean
1. Median
1. Mode
1. No of Mode
1. Number of Unique Value
1. List of Unique Value

In [96]:
def min_numeric(x):
    if pd.api.types.is_numeric_dtype(x):
        return x.min()
    elif pd.api.types.is_datetime64_any_dtype(x):
        return x.min()
    elif pd.api.types.is_datetime64_dtype(x):
        return x.min()
    elif pd.api.types.is_datetime64_ns_dtype(x):
        return x.min()
    elif pd.api.types.is_datetime64tz_dtype(x):
        return x.min()
    else:
        return ""

def max_numeric(x):
    if pd.api.types.is_numeric_dtype(x):
        return x.max()
    elif pd.api.types.is_datetime64_any_dtype(x):
        return x.max()
    elif pd.api.types.is_datetime64_dtype(x):
        return x.max()
    elif pd.api.types.is_datetime64_ns_dtype(x):
        return x.max()
    elif pd.api.types.is_datetime64tz_dtype(x):
        return x.max()
    else:
        return ""
    
def mean_numeric(x):
    if pd.api.types.is_numeric_dtype(x):
        return x.mean()
    elif pd.api.types.is_datetime64_any_dtype(x):
        return x.mean()
    elif pd.api.types.is_datetime64_dtype(x):
        return x.mean()
    elif pd.api.types.is_datetime64_ns_dtype(x):
        return x.mean()
    elif pd.api.types.is_datetime64tz_dtype(x):
        return x.mean()
    else:
        return ""
    
def median_numeric(x):
    if pd.api.types.is_numeric_dtype(x):
        return x.median()
    elif pd.api.types.is_datetime64_any_dtype(x):
        return x.median()
    elif pd.api.types.is_datetime64_dtype(x):
        return x.median()
    elif pd.api.types.is_datetime64_ns_dtype(x):
        return x.median()
    elif pd.api.types.is_datetime64tz_dtype(x):
        return x.median()
    else:
        return ""

In [97]:
df_types = df.dtypes
df_total = df.shape[0]
df_count = df.count()
df_NaN = df.isna().sum()
df_miss_percent = round(df_NaN / df_total * 100,2)

In [98]:
df_min = df.apply(min_numeric)
df_max = df.apply(max_numeric)
df_mean = df.apply(mean_numeric)
df_median = df.apply(median_numeric)
df_mode = df.apply(lambda x: x.mode().tolist())
df_mode_len = df.apply(lambda x: len(x.mode().tolist()))
df_nunique = df.nunique(dropna= False)

In [99]:
df_unique_list = []
for col in df.columns:
    df_unique_list.append(df[col].unique())

In [100]:
dict_des = {"Data Type": df_types, "Total": df_total, "Non Missing Value": df_count, "Missing Value": df_NaN, "Missing Percentage (%)": df_miss_percent,
            "Min": df_min, "Max": df_max, "Mean": df_mean, "Median": df_median, "Mode": df_mode, "No of Mode": df_mode_len,
            "Number of Unique Value": df_nunique, "Unique List": df_unique_list}

df_des = pd.DataFrame(dict_des, index= df.columns)
df_des

Unnamed: 0,Data Type,Total,Non Missing Value,Missing Value,Missing Percentage (%),Min,Max,Mean,Median,Mode,No of Mode,Number of Unique Value,Unique List
Record_ID,int64,327194,327194,0,0.0,4351,442277751,234711079.593935,254136301.0,"[4351, 144501, 147352, 187051, 194601, 453201,...",327194,327194,"[442277751, 442229801, 442224051, 442210651, 4..."
First_Name,string,327194,320772,6422,1.96,,,,,[Jennifer],1,32217,"[Alicia, Darryl, Donna, Anand, <NA>, Tanner, A..."
Last_Name,string,327194,321384,5810,1.78,,,,,[Smith],1,101720,"[Carlos Olleta, English, Holder, K, <NA>, Jone..."
Email,object,327194,327194,0,0.0,,,,,"[004424@thomas.org.br, 0071950@student.seark.e...",327194,327194,"[alicia.carlos@wamos.com, darryl@dallasdcs.com..."
Job_function,float64,327194,0,327194,100.0,,,,,[],0,1,[nan]
Job_Title,object,327194,270446,56748,17.34,,,,,[Director of Human Resources],1,59250,"[nan, Recruiter, Director, Delivery Digital Te..."
Industry,object,327194,161550,165644,50.63,,,,,[ -],1,537,"[nan, Higher Education, Machinery, Human Resou..."
Industry_Segmentation,object,327194,120515,206679,63.17,,,,,[Others],1,9,"[nan, Higher Education, Others, Staffing & Rec..."
Company_Name,object,327194,320509,6685,2.04,,,,,[.],1,166057,"[Wamos, Dcs, Columbia University, Fieldassist,..."
Company_Name_Clean_,object,327194,318451,8743,2.67,,,,,[Keller Williams Realty],1,165816,"[Wamos, Dcs, Columbia University, Fieldassist,..."


#### 4. Save Excel

In [101]:
# update the filename
new_File = excel_file + "_Output"

writer = pd.ExcelWriter(new_File + ".xlsx")
df_des.to_excel(writer)
writer.close()