## Test: Customers' sex profiling derived from their first name

This notebook explores how to profile customers to obtain their sex based on their name and the commonly associated sex to such names. It has been incorporated to the TablePrep notebook as it has proven succesful. If sucessful, will be incorporated into the "Table_Preparation" file.   

In [2]:
# relevant module

import pandas as pd


### Step 1 - Collecting name data from US

Note that customers are from US, so working with data from such country is preferred. There are paid services but this was a relatively easy way to skip through such services.

In [3]:

# baby name data for multiple years is obtained from US gov data → https://www.ssa.gov/oact/babynames/limits.html (note that using ChatGPT did not render correct results, responses for the intial names were accurate but as it progressed it simply assigned one single sex to all remainder names)

#  list of file paths
file_paths = [
    'Names/yob1997.txt',
    'Names/yob1998.txt',
    'Names/yob1999.txt',
    'Names/yob2000.txt',
    'Names/yob2001.txt',
    'Names/yob2002.txt',
    'Names/yob2003.txt',
    'Names/yob2004.txt',
    'Names/yob2005.txt',
    'Names/yob2006.txt',
    'Names/yob2007.txt',
    'Names/yob2008.txt',
    'Names/yob2009.txt',
    'Names/yob2010.txt',
    'Names/yob2011.txt',
    'Names/yob2012.txt',
    'Names/yob2013.txt',
    'Names/yob2014.txt',
    'Names/yob2015.txt',
    'Names/yob2016.txt',
    'Names/yob2017.txt',
    'Names/yob2018.txt',
    'Names/yob2019.txt',
    'Names/yob2020.txt',
    'Names/yob2021.txt',
    'Names/yob2022.txt',
    'Names/yob2023.txt'
]

# list to store DataFrames
dfs = []

# read each file into a DataFrame and append to the list
for file_path in file_paths:
    df = pd.read_csv(file_path, header=None)
    dfs.append(df)

# concat to join all DataFrames and expand the names across the list
result_df = pd.concat(dfs, axis=0, ignore_index=True)

# "master" list of names
print(result_df)


              0  1      2
0         Emily  F  25735
1       Jessica  F  21045
2        Ashley  F  20896
3         Sarah  F  20715
4        Hannah  F  20595
...         ... ..    ...
867561    Zyell  M      5
867562     Zyen  M      5
867563   Zymirr  M      5
867564   Zyquan  M      5
867565    Zyrin  M      5

[867566 rows x 3 columns]


In [4]:
# renaming columns' labels 
result_df.rename(columns={0:'Name', 1:'Sex',2:'Popularity'}, inplace=True)

### Step 2 - Collecting data from customers

In [5]:
# reading customers file to conduct the match of names and sex for my customers_df 
customers_df= pd.read_csv('Customers.csv', sep=',')

### Step 3 - Cleaning the data for matching 

In [6]:
# splitting the Customer Name to obtain the first name based on spaces and dashes

customers_df[['First_Name', _]] = customers_df['CustomerName'].str.split(r'\s|-', n=1, expand=True)

# dropping the second part (everything after the first space or dash)
customers_df = customers_df.drop(columns=[_])

print(customers_df)


    CustomerID       CustomerName      Segment First_Name
0     CG-12520        Claire Gute     Consumer     Claire
1     DV-13045    Darrin Van Huff    Corporate     Darrin
2     SO-20335     Sean O'Donnell     Consumer       Sean
3     BH-11710    Brosina Hoffman     Consumer    Brosina
4     AA-10480       Andrew Allen     Consumer     Andrew
..         ...                ...          ...        ...
788   CJ-11875       Carl Jackson    Corporate       Carl
789   RS-19870         Roy Skaria  Home Office        Roy
790   SC-20845         Sung Chung     Consumer       Sung
791   RE-19405    Ricardo Emerson     Consumer    Ricardo
792   SM-20905  Susan MacKendrick     Consumer      Susan

[793 rows x 4 columns]


### Step 4 - Conducting the customer-common name match

In [7]:
# based on the first name fo the customers a sex is assigned which is commonly associated to such names   

# Define a function to choose sex based on popularity for each group
def choose_sex(group):
    max_popularity_row = group.loc[group['Popularity'].idxmax()]
    return max_popularity_row['Sex']

# Group results_df by 'Name' and apply choose_sex function to each group
max_popularity_df = result_df.groupby('Name').apply(choose_sex).reset_index()
max_popularity_df.columns = ['First_Name', 'Sex']

# Merge customers_df with max_popularity_df on 'First_Name' to assign sex based on popularity
merged_df = pd.merge(customers_df, max_popularity_df, on='First_Name', how='left')

# Display the merged DataFrame
print(merged_df)


    CustomerID       CustomerName      Segment First_Name  Sex
0     CG-12520        Claire Gute     Consumer     Claire    F
1     DV-13045    Darrin Van Huff    Corporate     Darrin    M
2     SO-20335     Sean O'Donnell     Consumer       Sean    M
3     BH-11710    Brosina Hoffman     Consumer    Brosina  NaN
4     AA-10480       Andrew Allen     Consumer     Andrew    M
..         ...                ...          ...        ...  ...
788   CJ-11875       Carl Jackson    Corporate       Carl    M
789   RS-19870         Roy Skaria  Home Office        Roy    M
790   SC-20845         Sung Chung     Consumer       Sung    M
791   RE-19405    Ricardo Emerson     Consumer    Ricardo    M
792   SM-20905  Susan MacKendrick     Consumer      Susan    F

[793 rows x 5 columns]


In [8]:
# review of results to see if there are names without sex (i.e., non-null mismatch in Sex column)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 793 entries, 0 to 792
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CustomerID    793 non-null    object
 1   CustomerName  793 non-null    object
 2   Segment       793 non-null    object
 3   First_Name    793 non-null    object
 4   Sex           763 non-null    object
dtypes: object(5)
memory usage: 31.1+ KB


### Step 5 - Dealing with missing names

Given it is a considerable low volume the match for these names can either be done manually or with chatgpt (in this case results are accurate)  

In [9]:
# obtaining the names that did not have any match  
merged_df[merged_df.isna().any(axis=1)].to_dict()

{'CustomerID': {3: 'BH-11710',
  9: 'ZD-21925',
  27: 'KM-16720',
  43: 'PN-18775',
  114: 'Dl-13600',
  164: 'SS-20140',
  178: 'PO-19180',
  190: 'XP-21865',
  220: 'OT-18730',
  233: 'ZC-21910',
  237: 'CK-12205',
  242: 'CK-12595',
  261: 'RP-19390',
  268: 'NP-18325',
  285: 'PO-19195',
  313: 'EM-14095',
  316: 'CK-12760',
  318: 'BK-11260',
  337: 'SC-20050',
  366: 'KM-16225',
  368: 'DE-13255',
  415: 'JR-15700',
  460: 'DK-12835',
  461: 'ST-20530',
  521: 'HZ-14950',
  524: 'FM-14215',
  611: 'SG-20605',
  652: 'MS-17530',
  653: 'RH-19555',
  721: 'LS-17230'},
 'CustomerName': {3: 'Brosina Hoffman',
  9: 'Zuschuss Donatelli',
  27: 'Kunst Miller',
  43: 'Parhena Norris',
  114: 'Dorris liebe',
  164: 'Saphhira Shifley',
  178: 'Philisse Overcash',
  190: 'Xylona Preis',
  220: 'Olvera Toch',
  233: 'Zuschuss Carroll',
  237: 'Chloris Kastensmidt',
  242: 'Clytie Kelty',
  261: 'Resi Pölking',
  268: 'Naresj Patel',
  285: 'Phillina Ober',
  313: 'Eudokia Martin',
  316: 'Cy

In [11]:
# filling out the 30 missing names with a commonly associated sex

# Commonly associated sex based chatGPT's input
common_sex = {
    'Brosina': 'F',
    'Zuschuss': 'M',
    'Kunst': 'M',
    'Parhena': 'F',
    'Dorris': 'F',
    'Saphhira': 'F',
    'Philisse': 'F',
    'Xylona': 'F',
    'Olvera': 'M',
    'Chloris': 'F',
    'Clytie': 'F',
    'Resi': 'F',
    'Naresj': 'M',
    'Phillina': 'F',
    'Eudokia': 'F',
    'Cyma': 'F',
    'Berenike': 'F',
    'Sample': 'M', # Assuming Sample as male
    'Kalyca': 'F',
    'Deanra': 'F',
    'Jocasta': 'F',
    'Damala': 'F',
    'Shui': 'M',
    'Henia': 'F',
    'Filia': 'F',
    'Speros': 'M',
    'MaryBeth': 'F',
    'Ritsa': 'F',
    'Lycoris': 'F'
}

# Assign sex based on assumptions only for rows where 'Sex' column is NaN
merged_df.loc[merged_df['Sex'].isnull(), 'Sex'] = merged_df.loc[merged_df['Sex'].isnull(), 'First_Name'].map(common_sex)

# Display the DataFrame with assigned sex
print(merged_df)

    CustomerID       CustomerName      Segment First_Name Sex
0     CG-12520        Claire Gute     Consumer     Claire   F
1     DV-13045    Darrin Van Huff    Corporate     Darrin   M
2     SO-20335     Sean O'Donnell     Consumer       Sean   M
3     BH-11710    Brosina Hoffman     Consumer    Brosina   F
4     AA-10480       Andrew Allen     Consumer     Andrew   M
..         ...                ...          ...        ...  ..
788   CJ-11875       Carl Jackson    Corporate       Carl   M
789   RS-19870         Roy Skaria  Home Office        Roy   M
790   SC-20845         Sung Chung     Consumer       Sung   M
791   RE-19405    Ricardo Emerson     Consumer    Ricardo   M
792   SM-20905  Susan MacKendrick     Consumer      Susan   F

[793 rows x 5 columns]


In [15]:
# reviewing if there are any NaN values

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 793 entries, 0 to 792
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CustomerID    793 non-null    object
 1   CustomerName  793 non-null    object
 2   Segment       793 non-null    object
 3   First_Name    793 non-null    object
 4   Sex           793 non-null    object
dtypes: object(5)
memory usage: 31.1+ KB


In [14]:
# reviewing NaN values for names without sex are now filled with one sample:

merged_df[merged_df['First_Name']== 'Jocasta']

Unnamed: 0,CustomerID,CustomerName,Segment,First_Name,Sex
415,JR-15700,Jocasta Rupert,Consumer,Jocasta,F
