# Data Wrangling with Lytics Profile Data - Tools and Techniques

The goal of this notebook is to present some tools and techniques that can be used to wrangle Industry Dive data. 

## What is Data Wrangling again?
>Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.  Some transformation techniques include: parsing, joining, standardizing, augmenting, cleansing, and consolidating. 

[per wikipedia](https://en.wikipedia.org/wiki/Data_wrangling)

## Bad Data in, Bad Data out

![bad data in bad data out](https://cdn-images-1.medium.com/max/1200/0*YCghEemt6BtW9OZV.png "Bad Data in Bad Data out")

Many websites contain forms in order to collect information from users for various reasons.  In our case, we have signup forms for dives that asks for information about our users like so:

![signup form](../data/img/signup_form.png "signup form")

As you can see, there are fields that are restricted to pre-defined values (e.g., Job Function), and free-form fields (e.g., Company Name) where a user can type most anything they like.  Whenever users are exposed to free-form fields, there is a possibility of bad/messy/non-standardized data making into your system.

For example, here are some variants of "IKEA" that are present for user profiles that we have:

* IKEA
* IKEA AG
* IKEA Belgium
* IKEA Canada
* IKEA Danville
* IKEA Food
* IKEA Home Furnishings
* IKEA Portugal
* IKEA USA
* IKEA US EAST, LLC 215
* IKEA US

Without some wrangling, you would not be able to aggregate these folks properly into a single group based on company.

## Lytics Profile Data
Now, let's take a look at some Lytics profile data, which consists of all information we have about users who interact with our content.  Within this data, there are key demographic fields that can help us understand who our users are, such as:
* first and last name
* job title
* email domain
* company name
* address

The data file we are going to look at is an export of the "All" audience segment in Lytics.
https://activate.getlytics.com/audiences/4cc5d612f46fb86e5cfd0c995250e60c/summary?aid=2751

![All Audience segment in Lytics](../data/img/lytics_all_audience_segment.png "All Audience segment in Lytics")

Let's start looking at this data to see how we can clean it up in order to help us create more accurate statistics about our users.

In [15]:
import pandas as pd
import numpy as np

dtypes = {'company': 'str', 'company_name': 'str', 'domain': 'object', 'emaildomain': 'object', 'emaildomains': 'object',
         'st_profile_id': 'object', 'user_id': np.float64, 'lytics_segment': 'object'}
df = pd.read_csv('../data/files/lytics_profile_data_export.csv', sep=',', error_bad_lines=False, index_col=False, dtype=dtypes)

# list columns in dataset
print(list(df))

# number of rows
print('# of rows left: %s' % df.shape[0])
# print(df[df['st_profile_id'].str.contains("5a2ba1f6ff530ac11a8b4868", na=False)])

['company', 'company_name', 'domain', 'emaildomain', 'emaildomains', 'st_profile_id', 'user_id', 'lytics_segment']
# of rows left: 782425


There are multiple fields in the data we can choose to cleanup, but first let's look at the "company_name" field.  One of the first things we should do is get rid of rows with company name values we don't care about.

In [16]:
# remove null company name values
df = df.dropna(subset=['company_name'])

# number of rows
print('# of rows left: %s' % df.shape[0])

# of rows left: 458289


In [17]:
# find values that are any combination of special characters
special_char_values = df['company_name'].str.contains("^[!@#$%^&*(),.?]*$", na=False)
print(df[special_char_values].company_name.unique())

# number of rows
print('# of special character value rows: %s' % df[special_char_values].shape[0])
df = df[~special_char_values]

print('# of rows left: %s' % df.shape[0])
# print(df[df['st_profile_id'].str.contains("5a2ba1f6ff530ac11a8b4868", na=False)])

['..' '.' '...' '*' '********' '......' ',' '.....' '***' '????????' '?'
 '**' '.......' ',,' '@@']
# of special character value rows: 103
# of rows left: 458186


In [18]:
# find values that are only numbers
number_values = df['company_name'].str.contains("^[0-9]*$", na=False)
print(df[number_values].company_name.unique())

# number of rows
print('# of number value rows: %s' % df[number_values].shape[0])
df = df[~number_values]

print('# of rows left: %s' % df.shape[0])

['1948' '1989' '1954' '451' '1957' '1979' '252' '1953' '1967' '8020'
 '1960' '5' '104' '1999' '123' '1974' '1988' '1977' '1000' '900' '1956'
 '605' '8760' '1984' '1959' '1998' '1972' '1992' '1997' '1991' '111'
 '1990' '1987' '1970' '1969' '1965' '1968' '1995' '1993' '1975' '1963'
 '231112027' '53' '1976' '1985' '1949' '149' '0' '1971' '1986' '346'
 '47723' '1947' '94122202312' '1' '1958' '1973' '43' '1935' '1961' '1994'
 '1946' '325024080134' '1996' '1982' '15' '34' '1952' '271' '1980' '1966'
 '1936' '47' '1978' '1964' '1928' '50' '2714' '1955' '1690' '1942' '13'
 '05358359981' '9172077326' '12' '151' '1951' '2000' '400000000000' '2'
 '1905' '2020' '1940' '1983' '2008' '198' '2013' '1962' '411' '2015' '295'
 '1950' '940005848995' '11455' '83255804' '2166833' '1001' '6' '91957'
 '14' '887000000000' '666' '59' '963' '32000' '555' '404' '0789243438'
 '438' '68' '1945' '525' '825' '2009' '1981' '8001504151' '136' '359'
 '365' '308' '940003979987' '6164381822' '1107' '0673282495' '2040' '74

In [19]:
# random additional values that I found when I was looking at the data in Excel
weird_vals = ['#NAME?', '{Re}', '< self >']
weird_values = df['company_name'].isin(weird_vals)
df = df[~weird_values]

# left over rows in dataframe
print('# of rows left: %s' % df.shape[0])

# of rows left: 457462


Now that we have cleaned all the bad company name values from our dataset, let's work on standardizing the names to help with comparison.

In [20]:
# change the values to all lower case
df['stndrdzed_company_name'] = df['company_name'].str.lower()
# remove all punctuation
df["stndrdzed_company_name"] = df['stndrdzed_company_name'].str.replace('[^\w\s]','')

# remove rows with "none" as value
none_rows = df['stndrdzed_company_name'].str.contains('none', na=False)
df = df[~none_rows]

# remove rows with "" as value
empty_string_rows = df['stndrdzed_company_name'].values == ''
df = df[~empty_string_rows]

Let's take a look at our dataset to see what we are working with:

In [21]:
grouped = df.groupby('stndrdzed_company_name')

grouped = grouped.size().reset_index(name='counts')
grouped.sort_values(by=['counts'], ascending=False)

Unnamed: 0,stndrdzed_company_name,counts
449,20160506deleteme,1364
214673,self,692
115522,ibm,545
263824,walmart,523
3715,accenture,512
147255,macys,444
230752,student,414
72555,duke energy,412
163621,mr,384
214692,self employed,379


One thing to note from looking at this is that there are company names that contain values other than English.  For instance, "현대엔지니어링" is Korean.  This is one thing you could work on eliminating as well if you wanted to focus on English values.  I tried to use a library called "langdetect" for this, but it did not do a good job of picking up the obvious cases.

Once we have wrangled the data bit, we can now try to enhance our dataset with an external dataset.  One of the datasets we bought rights to recently, DiscoverOrg, has different information about companies that could be useful for analysis.  The common field these two datasets have is the company name.  So we can try to load this dataset, clean it up a bit, then compare it to our original cleaned dataset in order to try and match on company name and enhance our existing dataset.

In [34]:
dtypes= {'Company ID': np.int64, 'Company Name': 'str', 'Company Website': 'object', 'Company HQ Phone': 'object',
        'Company Email Domain': 'object', 'Company Description': 'object', 'Company Primary Industry': 'object',
        'Company Revenue': np.float64, 'Company IT Budget (Mil)': 'object', 'Number of Employees': np.int64,
        'Company IT Employees': np.float64, 'Company Fortune Rank': np.float64, 'Company Ownership': 'object', 'Company Profile URL': 'object',
        'Company Business Model (B2B/B2C/B2G)': 'object', 'Hospital Beds': 'object', 'HQ Address 1': 'object', 'HQ Address 2': 'object',
        'HQ City': 'object', 'HQ State': 'object', 'HQ Postal Code': 'object', 'HQ County': 'object', 'HQ Country': 'object'
        }
df2 = pd.read_csv('../data/files/DiscoverOrg_Company_223030_20180731141156.csv', encoding='latin-1', sep=',', error_bad_lines=False, index_col=False, dtype=dtypes)

# change the values to all lower case
df2['stndrdzed_company_name'] = df2['Company Name'].astype(str).str.lower()
# remove all punctuation
df2["stndrdzed_company_name"] = df2['stndrdzed_company_name'].str.replace('[^\w\s]','')
df2

Unnamed: 0,Company ID,Company Name,Company Website,Company HQ Phone,Company Email Domain,Company Description,Company Primary Industry,Company Revenue,Company IT Budget (Mil),Number of Employees,...,Company Business Model (B2B/B2C/B2G),Hospital Beds,HQ Address 1,HQ Address 2,HQ City,HQ State,HQ Postal Code,HQ County,HQ Country,stndrdzed_company_name
0,25321,1E,www.1e.com,+44-2083-263880,1e.com,1E is an IT Software company based in the Unit...,Computer Software,4.700000e+07,1.66,250,...,B2B/B2G,,CP House,97-107 Uxbridge Road,London,England,W5 5TL,"New York County, New York",United Kingdom,1e
1,2,24 Hour Fitness,www.24hourfitness.com,(925) 543-3100,24hourfit.com,"24 Hour Fitness USA, Inc. operates health club...",Leisure,1.420000e+09,33.8,22000,...,B2C,,12647 Alcosta Boulevard,Suite 500,San Ramon,CA,94583,"Contra Costa County, California",United States,24 hour fitness
2,4035320,3VR,www.3vr.com,(415) 495-5790,3vr.com,"Founded in 2002, 3VR is a video intelligence c...",Services,1.786000e+07,0,90,...,B2B,,814 Mission Street,Suite 400,San Francisco,CA,94103,"San Francisco County, California",United States,3vr
3,4040856,42West,www.42west.net,(212) 277-7555,42west.net,"With unparalleled experience, access, and judg...",Advertising / Marketing,1.000000e+07,0,138,...,B2B,,600 3rd Avenue,23rd Floor,New York,NY,10016,"New York County, New York",United States,42west
4,26455,4Com,www.4com.co.uk,+44-3304-444444,4com.co.uk,4com provide telecommunications services to co...,Telecom / Communication Services,5.121000e+07,2,257,...,B2B,,Loewy House,"11 Enterprise Way, Aviation Park West",Christchurch,England,BH23 6EW,,United Kingdom,4com
5,26367,84.51,www.8451.com,(513) 632-1020,8451.com,84.51 helps companies create sustainable growt...,Advertising / Marketing,1.100000e+07,1,500,...,B2B,,100 W 5th Street,,Cincinnati,OH,45202,"Hamilton County, Ohio",United States,8451
6,3,99 Cents Only Stores,www.99only.com,(323) 980-8145,99only.com,"Founded in 1982, 99 Cents Only Stores LLC is t...",Retail,2.194000e+09,40.5,17200,...,B2C,,4000 E Union Pacific Avenue,,City of Commerce,CA,90023,"Los Angeles County, California",United States,99 cents only stores
7,5,A-Dec,www.a-dec.com,(503) 538-9471,a-dec.com,A-Dec is a dental equipment manufacturing comp...,Medical Devices,2.950000e+08,10.92,1000,...,B2B/B2C/B2G,,2601 Crestview Drive,,Newberg,OR,97132,"Yamhill County, Oregon",United States,adec
8,10,AAR Corp,www.aarcorp.com,(630) 227-2000,aarcorp.com,"Founded in 1955, AAR provides a wide range of ...",Manufacturing - Durables,1.748000e+09,64.7,5000,...,B2B/B2G,,1100 N Wood Dale Road,,Wood Dale,IL,60191,"Dupage County, Illinois",United States,aar corp
9,11,Aaron's,www.aarons.com,(678) 402-3000,aarons.com,"Aaron Rents, Inc. is the leader in the rental,...",Services,3.384000e+09,159,11900,...,B2C,,400 Galleria Parkway SE,Suite 300,Atlanta,GA,30339,"Fulton County, Georgia",United States,aarons


In [35]:
# merge with discovery org data in order to find matches
merge = pd.merge(df, df2, how='inner', on=['stndrdzed_company_name'])

# number of rows after merging
df2.shape[0]

68735

In [37]:
# lets group the values again by company name to see what has been matched across datasets
groups = merge.groupby('stndrdzed_company_name')

groups = groups.size().reset_index(name='counts')
groups.sort_values(by=['counts'], ascending=False)

Unnamed: 0,stndrdzed_company_name,counts
8749,ibm,545
19404,walmart,523
217,accenture,512
10837,macys,444
15979,siemens,424
5640,duke energy,412
17285,target,378
12813,novartis,350
19448,waste management,335
13792,pfizer,324


In [38]:
merge

Unnamed: 0,company,company_name,domain,emaildomain,emaildomains,st_profile_id,user_id,lytics_segment,stndrdzed_company_name,Company ID,...,Company Profile URL,Company Business Model (B2B/B2C/B2G),Hospital Beds,HQ Address 1,HQ Address 2,HQ City,HQ State,HQ Postal Code,HQ County,HQ Country
0,,Ingredion,,,ingredion.com,,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
1,,Ingredion,ingredion.com,ingredion.com,,5bbfdb079c625f403c37563b,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
2,,Ingredion,ingredion.com,,,5bc7c98c95a7a10ca9387899,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
3,,Ingredion,ingredion.com,ingredion.com,,59d0fedc6e4adc98278b4cb2,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
4,,Ingredion,ingredion.com,ingredion.com,,58ee31bc15dd9627058ba97a,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
5,,Ingredion,ingredion.com,ingredion.com,ingredion.com,5b2299672ddf9c4e35528b20,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
6,,Ingredion,ingredion.com,ingredion.com,,5ace1a6595a7a11c6a121356,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
7,,Ingredion,ingredion.com,ingredion.com,ingredion.com,53bee9abdd52b8b35e0078d6,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
8,,Ingredion,,,ingredion.com,,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
9,,Ingredion,ingredion.com,ingredion.com,,5a4b2b5266c379d7728b491c,,All,ingredion,3355,...,https://go.discoverydb.com/eui/#/company/3355,B2B/B2C,,5 Westbrook Corporate Center,,Westchester,IL,60154,"Cook County, Illinois",United States
