# Data wrangling - Exploritory Data Analysis
### Most in demand data skills.
This notebook will wrangle the raw data produced by the webscraper.
The aim will be to take the data produced by the webscraper and  format it into a single csv file for further processing. 

In [1]:
import numpy as np
import pandas as pd
import re  
import os 
import popular_data_skills.config.config as config



We will find all the files saved in scraped data and read them into a csv file. 

In [2]:
# Get all files in data file folder 
file_list = os.listdir(config.SCRAPED_DATA_FOLDER)

# Loop through files and concat to create one large data frmae. 
df = pd.DataFrame()
for file in file_list:
    csv_file = config.SCRAPED_DATA_FOLDER + '\\' + file 
    temp_df = pd.read_csv(csv_file)
    df = pd.concat([df, temp_df],ignore_index=True) 
df.head(3)

Unnamed: 0.1,Unnamed: 0,job_title,company_name,location,work_type,date_posted,applicant_count,level,company_info,job_description_lines,country,job
0,0,"('DIGITAL DATA ANALYST - REMOTE',)","('Harnham',)","('London, England, United Kingdom',)","('Remote',)","('5 days ago',)","('25 applicants',)","('1 school alumni',)",('See how you compare to 25 applicants. Retry ...,"\nDIGITAL DATA ANALYST\n\nREMOTE\n\n£50,000 - ...",uk,analyst
1,1,"('DIGITAL DATA ANALYST - REMOTE',)","('Harnham',)","('London, England, United Kingdom',)","('Remote',)","('5 days ago',)","('25 applicants',)","('1 school alumni',)",('See how you compare to 25 applicants. Retry ...,"\nDIGITAL DATA ANALYST\n\nREMOTE\n\n£50,000 - ...",uk,analyst
2,2,"('Work from Home Opportunity | Data Analyst',)","('TELUS International AI Data Solutions',)","('London Area, United Kingdom',)","('Remote',)","('3 weeks ago',)",,"('',)","('Actively recruiting',)",\nTELUS International AI-Data Solutions partne...,uk,analyst


First we can drop the unnamed column.   
Then we will check to see some basic info about the df. 

In [3]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [4]:
# Look at dtypes and NANS
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   job_title              3723 non-null   object
 1   company_name           3723 non-null   object
 2   location               3723 non-null   object
 3   work_type              3723 non-null   object
 4   date_posted            3723 non-null   object
 5   applicant_count        2738 non-null   object
 6   level                  3723 non-null   object
 7   company_info           3723 non-null   object
 8   job_description_lines  3711 non-null   object
 9   country                3723 non-null   object
 10  job                    3723 non-null   object
dtypes: object(11)
memory usage: 320.1+ KB


In [5]:
df.describe(include='all')

Unnamed: 0,job_title,company_name,location,work_type,date_posted,applicant_count,level,company_info,job_description_lines,country,job
count,3723,3723,3723,3723,3723,2738,3723,3723,3711,3723,3723
unique,1422,659,450,1,54,183,490,252,1622,2,2
top,"('Data Analyst',)","('Varsity Tutors, a Nerdy Company',)","('United States',)","('Remote',)","('1 week ago',)","('1 applicant',)","('',)","('',)",\nCurrent Employees\n\nIf you are a current em...,usa,scientist
freq,294,514,817,3723,796,179,478,2071,290,2723,2474


There are some columns we can remove due to unuseful data
- work_type has remote in every entry
- level and company info don't have any usefull data 

In [6]:
# Drop unusefule columns 
df.drop(['work_type', 'level','company_info'], axis=1, inplace=True)
df.head()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
0,"('DIGITAL DATA ANALYST - REMOTE',)","('Harnham',)","('London, England, United Kingdom',)","('5 days ago',)","('25 applicants',)","\nDIGITAL DATA ANALYST\n\nREMOTE\n\n£50,000 - ...",uk,analyst
1,"('DIGITAL DATA ANALYST - REMOTE',)","('Harnham',)","('London, England, United Kingdom',)","('5 days ago',)","('25 applicants',)","\nDIGITAL DATA ANALYST\n\nREMOTE\n\n£50,000 - ...",uk,analyst
2,"('Work from Home Opportunity | Data Analyst',)","('TELUS International AI Data Solutions',)","('London Area, United Kingdom',)","('3 weeks ago',)",,\nTELUS International AI-Data Solutions partne...,uk,analyst
3,"('Online Data Analyst',)","('TELUS International AI Data Solutions',)","('Greater Cheshire West and Chester Area',)","('1 week ago',)","('85 applicants',)",\nTELUS International AI-Data Solutions partne...,uk,analyst
4,"('Online Data Analyst',)","('TELUS International AI Data Solutions',)","('Wolverhampton, England, United Kingdom',)","('1 week ago',)","('57 applicants',)",\nTELUS International AI-Data Solutions partne...,uk,analyst


Next:
- All values are in tuples
- Change dtype of applicants to int   
- Look at duplicate entires from spamming companies (companies that send the same job post to multiple locations)
- There are many Nans in in the applicants cloumn but we won't need that right now. 
- We will have a look at the job_description column to see any NaNs

In [7]:
# Remove df values from tuples 
columns = list(df.columns)
for column in columns:
    df[column] = df[column].apply(lambda x: re.sub(r"[('|',)]", '', str(x)))

In [8]:
# Remove 'applicants' string from applicant_count column
df = df.astype(str)
df['applicant_count'] = df['applicant_count'].apply(lambda x : re.sub(r" applicants", '', str(x)))

# Change strings to floats
df['applicant_count'] = pd.to_numeric(df['applicant_count'], errors='coerce')    
df.head()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
0,DIGITAL DATA ANALYST - REMOTE,Harnham,London England United Kingdom,5 days ago,25.0,\nDIGITAL DATA ANALYST\n\nREMOTE\n\n£50000 - £...,uk,analyst
1,DIGITAL DATA ANALYST - REMOTE,Harnham,London England United Kingdom,5 days ago,25.0,\nDIGITAL DATA ANALYST\n\nREMOTE\n\n£50000 - £...,uk,analyst
2,Work from Home Opportunity Data Analyst,TELUS International AI Data Solutions,London Area United Kingdom,3 weeks ago,,\nTELUS International AI-Data Solutions partne...,uk,analyst
3,Online Data Analyst,TELUS International AI Data Solutions,Greater Cheshire West and Chester Area,1 week ago,85.0,\nTELUS International AI-Data Solutions partne...,uk,analyst
4,Online Data Analyst,TELUS International AI Data Solutions,Wolverhampton England United Kingdom,1 week ago,57.0,\nTELUS International AI-Data Solutions partne...,uk,analyst


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   job_title              3723 non-null   object 
 1   company_name           3723 non-null   object 
 2   location               3723 non-null   object 
 3   date_posted            3723 non-null   object 
 4   applicant_count        2559 non-null   float64
 5   job_description_lines  3723 non-null   object 
 6   country                3723 non-null   object 
 7   job                    3723 non-null   object 
dtypes: float64(1), object(7)
memory usage: 232.8+ KB


Next we can deal with duplicate post that might skew the results.  
- First companies reposting job posts after a while.   
- Spamming companies posting the same job post to variuos locatioin for better search rankings   
- We scraped multiple search queries that are very similar. There a good chance search queires would pick up duplicate results.   


In [10]:
# Convert all the strings to lower case to improve searches 
columns_lower = df.columns != 'applicant_count'
df.loc[:,columns_lower] = df.loc[:,columns_lower].applymap(lambda x: x.lower())


# Delete any reposted jobs entries with the same title and same company name
df.drop_duplicates(subset=['job_title', 'company_name'], inplace=True,ignore_index=True)

df.tail()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
1530,machine learning engineer,fiscalnote,washington dc,1 week ago,7.0,\nabout this position\n\nat fiscalnote we buil...,usa,scientist
1531,software engineer,yahoo,united states,2 weeks ago,24.0,\n it takes powerful technology t...,usa,scientist
1532,java full-stack developer__md,dice,united states,22 hours ago,8.0,\n dice is the leading career des...,usa,scientist
1533,full stack javascript engineer,dice,united states,1 month ago,15.0,\n dice is the leading career des...,usa,scientist
1534,.netcore sitefinity developers,dice,united states,3 hours ago,,\n dice is the leading career des...,usa,scientist


In [11]:
# Check for duplicated job descriptions
total_job_des = len(df['job_description_lines'])
unique_job_des = len(df['job_description_lines'].unique())
print(f'Duplicated job descriptions: {total_job_des - unique_job_des}')



Duplicated job descriptions: 17


In [12]:
# Drop any duplicated job descriptions
df.drop_duplicates('job_description_lines', inplace=True, ignore_index=True)
df.tail()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
1513,machine learning engineer,fiscalnote,washington dc,1 week ago,7.0,\nabout this position\n\nat fiscalnote we buil...,usa,scientist
1514,software engineer,yahoo,united states,2 weeks ago,24.0,\n it takes powerful technology t...,usa,scientist
1515,java full-stack developer__md,dice,united states,22 hours ago,8.0,\n dice is the leading career des...,usa,scientist
1516,full stack javascript engineer,dice,united states,1 month ago,15.0,\n dice is the leading career des...,usa,scientist
1517,.netcore sitefinity developers,dice,united states,3 hours ago,,\n dice is the leading career des...,usa,scientist


Next we will look at the number of posts per company to look for possible spam posts. 

In [13]:
df['company_name'].value_counts()

varsity tutors a nerdy company    411
dice                              130
remoteworker uk                    74
jobs via efinancialcareers         10
frontiers                          10
                                 ... 
staffgroup uk & europe              1
climax studios                      1
adaptavist                          1
xero                                1
yahoo                               1
Name: company_name, Length: 656, dtype: int64

Varisty Tutors has a larger number of posts than other companies. Tutors in the name also raises some suspicions that deserve a closer look.

In [14]:
# view individual descriptions to look for similarities 
spam_comp_df = df[df['company_name'] == 'varsity tutors a nerdy company']

# Seems to be different posting from an agency lets check the job titles and manually scroll through some of the descriptions 
total_job_titles = len(spam_comp_df['job_title'])
unqiue_job_titles = len(spam_comp_df['job_title'].unique())
print(f'Duplicated job titles: {total_job_des - unique_job_des}')


Duplicated job titles: 17


In [15]:
# As the descriptions are long a difficult to read in a dataframe we can access individual directly to read manually to check for any signs of duplication
# Change the row integer for different descriptions
print(f"Description 1 :\n {spam_comp_df.iloc[1]['job_description_lines']}\n\n")
print(f"Description 2 :\n {spam_comp_df.iloc[2]['job_description_lines']}")


Description 1 :
 
virginia beach data analysis tutor jobs

the varsity tutors platform has thousands of students looking for online data analysis tutors nationally and in virginia beach. as a tutor who uses the varsity tutors platform you can earn good money choose your own hours and truly make a difference in the lives of your students.

why join our platform?
enjoy competitive rates and get paid 2x per week.choose to tutor as much or as little as you want.set your own hours and schedule.get paired with students best-suited to your teaching style and preferences from thousands of potential clients.tutor online i.e. “work remotely” using our purpose-built live learning platform.students can take adaptive assessments through the platform and share results to help you decide where to focus.we collect payment from the customers so all you have to do is invoice the session.
what we look for in a tutor
you have excellent communication skills and a friendly approachable personality.you can s

From looking at the varsity tutors company there are a lot of posts that are very simialr but just have different locations. This looks like spamming but we know the job descriptions are different. Theres are also a lot of tutoring jobs that don't relflect what we are looking for. There are also no appilcants 

These descriptions are extremely similar with key words changed. We will remove this company from our data set. 

In [16]:
# Remove enties from varisity tutors 
df = df[df['company_name'] != 'varsity tutors a nerdy company']
df.head()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
0,digital data analyst - remote,harnham,london england united kingdom,5 days ago,25.0,\ndigital data analyst\n\nremote\n\n£50000 - £...,uk,analyst
1,work from home opportunity data analyst,telus international ai data solutions,london area united kingdom,3 weeks ago,,\ntelus international ai-data solutions partne...,uk,analyst
2,online data analyst,telus international ai data solutions,greater cheshire west and chester area,1 week ago,85.0,\ntelus international ai-data solutions partne...,uk,analyst
3,data analyst,ovo,united kingdom,3 weeks ago,18.0,\nlocation - flexible\n\nwe’re making zero car...,uk,analyst
4,part-time online data analyst,telus international ai data solutions,hampshire england united kingdom,3 weeks ago,136.0,\ntelus international ai-data solutions partne...,uk,analyst


In [17]:
# view individual descriptions to look for similarities 
spam_comp_df = df[df['company_name'] == 'dice']
print(f"Description 1 :\n {spam_comp_df.iloc[1]['job_description_lines']}\n\n")
print(f"Description 2 :\n {spam_comp_df.iloc[2]['job_description_lines']}")

Description 1 :
 
              dice is the leading career destination for tech experts at every stage of their careers. our client ipivot llc is seeking the following. apply via dice today!

greetings

hope you are doing well.

we have an urgent requirement with our direct clients please go through the job details and let us know if you are interested do send us your updated resume and contact details

 data analyst with bsa

location: remote

contract length: 24+ months

rates are $ open

 location at princeton  nj 

"strong in:

data analyst with bsa
 provided by dice


 


Description 2 :
 
              dice is the leading career destination for tech experts at every stage of their careers. our client margin5 solutions inc is seeking the following. apply via dice today!

title- data analyst

location- remote

required:
bachelor’s degree required in computer science statistics or equivalent work experience1-3 years working experience using microsoft reporting tools and technologies

Other companies have very different job descriptions and look like 
genuine companies.   


## Removing unwanted job roles and updating roles  
There are many unwanted job roles like software engineer that we don't want. Data analyst and data scientist are similar fields so both roles show up in either search. To solve this we will search job titles to define roles.  

In [18]:
# Create a filter df with keywords in the title and a df with all other entries. 
analyst_words = ['analytics', 'analyst', 'analysis' ]
matches_regex = "|".join(analyst_words)
mask = df['job_title'].str.contains(matches_regex, regex=True)
analyst_df = df[mask].copy()
not_analyst_df = df[~mask]
analyst_df.head()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
0,digital data analyst - remote,harnham,london england united kingdom,5 days ago,25.0,\ndigital data analyst\n\nremote\n\n£50000 - £...,uk,analyst
1,work from home opportunity data analyst,telus international ai data solutions,london area united kingdom,3 weeks ago,,\ntelus international ai-data solutions partne...,uk,analyst
2,online data analyst,telus international ai data solutions,greater cheshire west and chester area,1 week ago,85.0,\ntelus international ai-data solutions partne...,uk,analyst
3,data analyst,ovo,united kingdom,3 weeks ago,18.0,\nlocation - flexible\n\nwe’re making zero car...,uk,analyst
4,part-time online data analyst,telus international ai data solutions,hampshire england united kingdom,3 weeks ago,136.0,\ntelus international ai-data solutions partne...,uk,analyst


Let's check the job titles we have filtered out to make sure we aren't missing any key words 

In [19]:
not_analyst_df['job_title'].unique()

array(['oracle data engineer', 'dba/ data engineer',
       'threat data engineer', 'data migration specialist',
       'ipsoft developer', 'data engineer', 'junior tableau developer',
       'technical lead - data engineering', 'netsuite developer',
       'azure data engineer',
       'sql server developer - remote - £70k / £80k',
       'data engineer remote', 'data warehouse engineer',
       'shopify developer',
       'devsecops engineer - security and data governance us remote',
       'freelance associate database programmer remote',
       'systems engineer', 'netsuite developer remote',
       'systems engineer - #665', 'data engineer - hatfield/remote',
       'data engineer  remote  68k - 103k eur + 12% bonus per annum',
       'multiple junior devs php laravel - remote - £35000 doe',
       'database engineer',
       'systems engineer – unsociable hours – remote - msp - £40-45k',
       'data engineer/ sql developer', 'rust developer remote',
       'golang developer  ins

It seems we haven't missed many analytics roles so we will move to the next role. 

In [20]:
# Filtering for scientist jobs 
scientist_words = ['science','scientist','scientiest','machine learning']
matches_regex = "|".join(scientist_words)
mask = df['job_title'].str.contains(matches_regex, regex=True)
scientist_df = df[mask].copy()
not_scientist_df = df[~mask]

scientist_df.tail()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
1496,data scientist,bitsight,boston ma,1 month ago,10.0,\n bitsight is looking for data s...,usa,scientist
1498,data scientist,pandadoc,united states,19 hours ago,143.0,\n your role as the data scientis...,usa,scientist
1499,data scientist,liberty mutual insurance,united states,4 days ago,123.0,\nadvance your data science career at liberty ...,usa,scientist
1501,full stack developer data scientist,asrc federal,united states,1 month ago,5.0,\n asrc federal is seeking a full...,usa,scientist
1513,machine learning engineer,fiscalnote,washington dc,1 week ago,7.0,\nabout this position\n\nat fiscalnote we buil...,usa,scientist


In [21]:
# Manually check to see if we have missed any roles
not_scientist_df['job_title'].unique()

array(['digital data analyst - remote',
       'work from home opportunity  data analyst', 'online data analyst',
       'data analyst', 'part-time online data analyst',
       'part-time job opportunity  data analyst',
       'data analyst - home working remote',
       'work from home opportunity in the uk as data analyst',
       'data analyst - uk mostly remote', 'data analyst- ni remote',
       'project data analyst', 'google data analyst',
       'revenue data analyst', 'game data analyst',
       'junior data analyst nix & kix - kickstarter *',
       'data analyst build infrastructure',
       'data quality analyst - 12 month contract - remote working',
       'data analyst inventory partnerships uk', 'product data analyst',
       'salesforce data analyst',
       'data analyst sql end user remote  £45k',
       'data analyst graduate career accelerator',
       'data insight analyst - green energy giant - london & remote -',
       'performance data analyst', 'data quality a

In [22]:
# Use the scientist df to get the subset of jr positions
jr_words = ['jr', 'junior', 'intern']
matches_regex = "|".join(jr_words)
mask = scientist_df['job_title'].str.contains(matches_regex, regex=True)
jr_scientist_df = scientist_df[mask]
scientist_df = scientist_df[~mask]

jr_scientist_df.tail()
len(jr_scientist_df)

10

After a closer look the number of post including junior in the title is not acceptable for further analysis. After continuing exploritory analysis the best way to find acceptable junior roles is through using years of experience explained later in this notebook. 

In [23]:
# Change new job values 
scientist_df['job'] = 'scientist'
analyst_df['job'] = 'analyst'

# Concat filtered df into new df 
df = pd.concat([scientist_df, analyst_df],ignore_index=True)
df.head()

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
0,data engineer / machine learning engineer,streamba,glasgow scotland united kingdom,1 month ago,7.0,\n as a data engineer / machine l...,uk,scientist
1,data analyst support scientist – remote genomi...,hireresources,santa ana ca,3 days ago,5.0,\none of our valued clients a leading and grow...,usa,scientist
2,data scientist remote,yelp,glasgow scotland united kingdom,4 days ago,116.0,\n at yelp it’s our mission to co...,uk,scientist
3,data scientist - growth,spotify,london england united kingdom,2 weeks ago,,\n the freemium r&d team oversees...,uk,scientist
4,data scientist - growth strategy,spotify,london england united kingdom,2 weeks ago,,\n are you a talented data scient...,uk,scientist


Let's get an overview of our new data frame. 

In [24]:
df.describe(include='all')

Unnamed: 0,job_title,company_name,location,date_posted,applicant_count,job_description_lines,country,job
count,593,593,593,593,480.0,593,593,593
unique,340,412,161,31,,579,2,2
top,data scientist,dice,united states,1 month ago,,\nabout vsco\n\nat vsco our mission is to nurt...,usa,scientist
freq,85,50,211,131,,2,499,354
mean,,,,,46.44375,,,
std,,,,,48.390558,,,
min,,,,,2.0,,,
25%,,,,,9.0,,,
50%,,,,,27.0,,,
75%,,,,,65.5,,,


## Adding experience  
Next we will extract experience metioned in the descriptions.   
Most posts mention x years or x years of experience.   
We will find experience and return the digit prior to experience.  
Regex 101 was used to test and find the best and fastest pattern to search.

In [25]:
# Find 'experience' in each 'job_description_lines' and return the digit before 
pattern = re.compile(r"((\d)|(\d )|(few ))([A-Za-z0-9\'`+]+ )([A-Za-z0-9\+]+ )(?:[A-Za-z0-9\+]+ ){0,6}experience")
df['experience'] = df['job_description_lines'].apply(lambda x: re.findall(pattern, x))
df.head()
df['experience']

0                                                     []
1      [(2 , , 2 , , years , of ), (1 , , 1 , , to , ...
2                        [(few , , , few , years , of )]
3                               [(3, 3, , , + , years )]
4                               [(3, 3, , , + , years )]
                             ...                        
588                             [(3, 3, , , + , years )]
589    [(2, 2, , , + , years ), (2, 2, , , + , years ...
590                             [(2, 2, , , + , years )]
591                           [(3 , , 3 , , or , more )]
592                             [(5, 5, , , + , years )]
Name: experience, Length: 593, dtype: object

Some things to note:
- Most posts ask for experience in years but a few ask for 6 months experience. We will ensure any entry in months is changed to 0 years. 
- few years experience will be taken as 2.
- We will leave just the first digit to make the column numeric.  

In [26]:
    
def extract_string(x):
    # If list is empty no numerical reference to experience was found 
    if len(x) == 0:
        return np.nan 
    
   # Loop through list of tuples
    for tuple_x in x:
        # Loop through tuple and strip white space from all strings 
        y = tuple(word.strip() for word in tuple_x)
        # search for months and years in tuples if found break loop
        if ('years' in y) or ('year' in y) or ('yr' in y):
            # Some enteries state a few years this is taken as 2. 
            if y[0] == 'few':
                y = 2
                break
            else:
                y = y[0]
                break
                
        elif ('month' in y) or ('months' in y) or ('months+' in y):
            y = 0
            break
        
        else:
            y = np.nan
    return y

        
        
    
df['experience'] = df['experience'].apply(lambda x: extract_string(x))
df['experience'] = pd.to_numeric(df['experience'],errors='coerce')

df['experience'].unique()


array([nan,  2.,  3.,  5.,  7.,  4.,  1.,  8.,  6.,  0.])

There is a lot of loops to capture all instances of experience. As the dataframe is small there is no problem. Using large dataframes may need a different method.   

Now we will check the values not in the data frame to check if we missed anything. 

In [27]:
df['experience'].isna().sum()

284

In [28]:
no_experience = df[df['experience'].isna()].copy()

In [29]:
pattern = re.compile(r"((\d)|(\d )|(few ))([A-Za-z0-9\'`+]+ )([A-Za-z0-9\+]+ )(?:[A-Za-z0-9\+]+ ){0,6}experience")
no_experience['experience'] = no_experience['job_description_lines'].apply(lambda x: re.findall(pattern, x))
no_experience['experience']

0                              []
5                              []
7                              []
8                              []
12                             []
                  ...            
579                            []
583                            []
585                            []
587                            []
591    [(3 , , 3 , , or , more )]
Name: experience, Length: 284, dtype: object

Looks like we've caught all instances we can use.  
To double check we search the no experience df to see what results we get for experience. 

In [30]:
pattern = re.compile(r"(\w+ ){3}experience")
experienc_check = no_experience['job_description_lines'].apply(lambda x: re.findall(pattern, x))
experienc_check.to_list()

[['and '],
 [],
 [],
 [],
 ['international ', 'of ', 'of ', 'and ', 'backgrounds '],
 ['user ', 'practices ', 'have ', 'engines ', 'actions '],
 ['and ', 'or ', 'commercial ', 'and ', 'industry '],
 ['on ', 'frameworks ', 'datasets ', 'pytorch ', 'pyspark ', 'significant '],
 ['and '],
 ['customer ', 'employee ', 'that '],
 [],
 [],
 ['have '],
 ['prospecting ', 'for ', 'and ', 'your '],
 ['your '],
 ['backgrounds '],
 ['solid ', 'solid ', 'on ', 'good ', 'algorithms ', 'on '],
 ['an ', 'and ', 'professional '],
 [],
 ['cyclesextensive ', 'platformsextensive '],
 [],
 ['an ', 'sciences ', 'sciences ', 'sciences ', 'sciences ', 'wrangling '],
 [],
 ['customer '],
 ['advanced '],
 [],
 ['of '],
 ['building ', 'environmentdod ', 'on '],
 [],
 ['proven ', 'science ', 'personal '],
 ['work '],
 ['on ', 'on '],
 [],
 ['and ', 'relevant ', 'science ', 'insightdemonstrated ', 'on '],
 ['with ', 'product ', 'proven ', 'science ', 'of ', 'personal '],
 ['employee ', 'talent ', 'employee ', 'empl

All entries relate to experience of skills which will be picked up when we look at key words.  

In [31]:
df.to_csv(config.WRANGLED_DATA_FILE, index=False)
