# Data Wrangling with Lytics Profile Data - Tools and Techniques

The goal of this notebook is to present some tools and techniques that can be used to wrangle Industry Dive data. 

## What is Data Wrangling again?
>Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" >data form into another format with the intent of making it more appropriate and valuable for a variety of downstream >purposes such as analytics.

[per wikipedia](https://en.wikipedia.org/wiki/Data_wrangling)

## Bad Data in, Bad Data out

![bad data in bad data out](https://cdn-images-1.medium.com/max/1200/0*YCghEemt6BtW9OZV.png "Bad Data in Bad Data out")

Many websites contain forms in order to collect information from users for various reasons.  In our case, we have signup forms for dives that asks for information about our users like so:

![signup form](../data/img/signup_form.png "signup form")

As you can see, there are fields that are restricted to pre-defined values (e.g., Job Function), and free-form fields (e.g., Company Name) where a user can type most anything they like.  Whenever users are exposed to free-form fields, there is a possibility of bad/messy/non-standardized data making into your system.  For example, let's say two people signup for Retail Dive and in their signup forms enter the following for the "Company Name" field:
* User 1 - Company Name: "Walmart"
* User 2 - Company Name: "Wal-mart, Inc."

While we visually can see that these two values share some differences yet seem to be the same company, if we attempted to create counts of our users per company, these two users would not be in the group.



Some transformation techniques include: parsing, joining, standardizing, augmenting, cleansing, and consolidating. 




## Lytics Profile Data
We now use Lytics in order to house all data we know about users who interact with our content.  This data comes from many systems, but regardless of source, there are certain demographic fields in this dataset that can help us understand who our users are, such as:
* first and last name
* job title
* email
* company name
* address

The data file being used for this notebook is an export of the "All" audience segment in Lytics.
https://activate.getlytics.com/audiences/4cc5d612f46fb86e5cfd0c995250e60c/summary?aid=2751

![All Audience segment in Lytics](../data/img/lytics_all_audience_segment.png "All Audience segment in Lytics")

## Use Cases

### Company Name
First, we will look at some techniques to apply to our dataset based on a recent request from Audience Dev.  They would like to create aggregate statistics about our users based on company name, so this will be the basis upon which we will transform our data.

That being said, the export file size (782,426 rows/50.7 MB) made my RAM unhappy, so I decided to cut the file down based on the above-stated use case.  First, I removed all rows from the file which had a blank Company Name.  Next, I removed some obvious bad data (e.g., "*", "



The Lytics audience sgement export file I ended up with at the time of exporting was big (782,426 rows/50.7 MB) and my RAM wasn't too happy with that, so I decided to cut the file down based on a recent request.  Audience Dev would like to create aggregate statistics about our users based on company name.  Unfortunately, its not as easy as counting by a company name field in the data.





In [None]:
import pandas as pd

df_lytics = pd.read_csv('lytics_profiles_export_comp_name_sample.csv', encoding='latin-1')

In [None]:
df_lytics.shape

In [None]:
df_disc_org = pd.read_csv('DiscoverOrg_Company_223030_20180731141156.csv', encoding='latin-1')

In [None]:
df_lytics.head()

In [None]:
df_disc_org.count()

In [None]:
df_disc_org.columns = ['company_id', 'company_name', 'domain','company_primary_industry','hq_country']

In [None]:
list(df_disc_org)

In [None]:
merge = pd.merge(df_lytics, df_disc_org, how='inner', on=['company_name'])

In [None]:
merge