# Data Wrangling with Lytics Profile Data - Tools and Techniques

The goal of this notebook is to present some tools and techniques that can be used to wrangle Industry Dive data. 

## What is Data Wrangling again?
>Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.  Some transformation techniques include: parsing, joining, standardizing, augmenting, cleansing, and consolidating. 

[per wikipedia](https://en.wikipedia.org/wiki/Data_wrangling)

## Bad Data in, Bad Data out

![bad data in bad data out](https://cdn-images-1.medium.com/max/1200/0*YCghEemt6BtW9OZV.png "Bad Data in Bad Data out")

Many websites contain forms in order to collect information from users for various reasons.  In our case, we have signup forms for dives that asks for information about our users like so:

![signup form](../data/img/signup_form.png "signup form")

As you can see, there are fields that are restricted to pre-defined values (e.g., Job Function), and free-form fields (e.g., Company Name) where a user can type most anything they like.  Whenever users are exposed to free-form fields, there is a possibility of bad/messy/non-standardized data making into your system.

For example, here are some variants of "IKEA" that are present for user profiles that we have:

* IKEA
* IKEA AG
* IKEA Belgium
* IKEA Canada
* IKEA Danville
* IKEA Food
* IKEA Home Furnishings
* IKEA Portugal
* IKEA USA
* IKEA US EAST, LLC 215
* IKEA US

Without some wrangling, you would not be able to aggregate these folks properly into a single group based on company.

## Lytics Profile Data
We now use Lytics in order to house all data we know about users who interact with our content.  This data comes from many systems, but regardless of source, there are certain demographic fields in this dataset that can help us understand who our users are, such as:
* first and last name
* job title
* email domain
* company name
* address

The data file being used for this notebook is an export of the "All" audience segment in Lytics.
https://activate.getlytics.com/audiences/4cc5d612f46fb86e5cfd0c995250e60c/summary?aid=2751

![All Audience segment in Lytics](../data/img/lytics_all_audience_segment.png "All Audience segment in Lytics")

First, we will load our data file:

In [398]:
import pandas as pd

df = pd.read_csv('../data/files/lytics_profile_data_export.csv', encoding='latin-1')

# list columns in dataset
print(list(df))

# number of rows
df.shape[0]

['company', 'company_name', 'domain', 'emaildomain', 'emaildomains', 'st_profile_id', 'user_id', 'lytics_segment']


782425

There are multiple fields in the data we can choose to cleanup, but first let's look at the "company_name" field.

In [399]:
# remove null company name values
df = df.dropna(subset=['company_name'])

# number of rows
df.shape[0]

458289

In [400]:
# find values that are any combination of special characters
special_char_values = df['company_name'].str.contains("^[!@#$%^&*(),.?]*$", na=False)
print(df[special_char_values].company_name.unique())
# number of rows
print(df.shape[0])
df = df[~special_char_values]

['..' '.' '...' '*' '********' '......' ',' '.....' '***' '????????' '?'
 '**' '.......' ',,' '@@']
458289


In [401]:
# find values that are only numbers
number_values = df['company_name'].str.contains("^[0-9]*$", na=False)
print(df[number_values].company_name.unique())
# number of rows
print(df[number_values].shape[0])
df = df[~number_values]

['1948' '1989' '1954' '451' '1957' '1979' '252' '1953' '1967' '8020'
 '1960' '5' '104' '1999' '123' '1974' '1988' '1977' '1000' '900' '1956'
 '605' '8760' '1984' '1959' '1998' '1972' '1992' '1997' '1991' '111'
 '1990' '1987' '1970' '1969' '1965' '1968' '1995' '1993' '1975' '1963'
 '231112027' '53' '1976' '1985' '1949' '149' '0' '1971' '1986' '346'
 '47723' '1947' '94122202312' '1' '1958' '1973' '43' '1935' '1961' '1994'
 '1946' '325024080134' '1996' '1982' '15' '34' '1952' '271' '1980' '1966'
 '1936' '47' '1978' '1964' '1928' '50' '2714' '1955' '1690' '1942' '13'
 '05358359981' '9172077326' '12' '151' '1951' '2000' '400000000000' '2'
 '1905' '2020' '1940' '1983' '2008' '198' '2013' '1962' '411' '2015' '295'
 '1950' '940005848995' '11455' '83255804' '2166833' '1001' '6' '91957'
 '14' '887000000000' '666' '59' '963' '32000' '555' '404' '0789243438'
 '438' '68' '1945' '525' '825' '2009' '1981' '8001504151' '136' '359'
 '365' '308' '940003979987' '6164381822' '1107' '0673282495' '2040' '74

In [403]:
# random additional values that I found when I was looking at the data in Excel
weird_vals = ['#NAME?', '{Re}', '< self >']
weird_values = df['company_name'].isin(weird_vals)
df = df[~weird_values]
# left over rows in dataframe
print(df.shape[0])

457462


In [311]:
## Use Cases

### Company Name
First, we will look at some techniques to apply to our dataset based on a recent request from Audience Dev.  They would like to create aggregate statistics about our users based on company name, so this will be the basis upon which we will transform our data.

That lytics export file (3,488,529 rows/50.7 MB) made my RAM unhappy, so I decided to cut the file down based on the above-stated use case.  


First, I removed all rows from the file which had a blank Company Name.  Next, I removed some obvious bad data (e.g., "*", "11").

SyntaxError: invalid syntax (<ipython-input-311-d834e9ce6ef2>, line 4)

In [None]:
df_disc_org = pd.read_csv('DiscoverOrg_Company_223030_20180731141156.csv', encoding='latin-1')
df_disc_org.columns = ['company_id', 'company_name', 'domain','company_primary_industry','hq_country']

In [None]:
df_disc_org.columns = ['company_id', 'company_name', 'domain','company_primary_industry','hq_country']

In [None]:
# merge with discovery org data in order to find matches
merge = pd.merge(df_lytics, df_disc_org, how='inner', on=['company_name'])