# Handle Basic Problems

Once we have data, we have to figure out if it's useable yet.  Specifically:

* Handle Missing Data
* Provide Default Values
* Impute Values
* Detect Outliers
* Filter Outliers


## Default Dependencies (Again)

These are the same dependencies I typically use...sloppily, quickly.

In [1]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
figsize(11,9)

import scipy.stats as stats

import pymc as pm

In [2]:
import requests

from os.path import join, dirname
import os
from dotenv import load_dotenv, find_dotenv

dotenv_path = join(dirname('__file__'), '.env')
load_dotenv(dotenv_path)

API_KEY = os.environ.get("API_KEY")

Also, here are some data frames, pre-loaded.

In [3]:
git_logs_filename = 'data/popular_open_source_logs.csv'
columns = ['timestamp', 'project', 'email', 'lines_inserted', 'lines_removed']
git_logs = pd.read_csv(git_logs_filename, index_col='timestamp', usecols=columns)

posts_filename = "data/posts-2016-06-08-21-35-42.csv"
columns = ['Author', 'Time', 'Text', 'ProfileUrl', 'PostUrl', 'Lang',
           'Sentiment']
posts = pd.read_csv(posts_filename, parse_dates=['Time'], index_col='Time', usecols=columns)
posts['HourOfDay'] = posts.index.hour
posts['DayOfWeek'] = posts.index.dayofweek

alternate_posts_filename = "data/alt_posts-2016-06-06.csv"
alternate_posts = pd.read_csv(alternate_posts_filename)

# This is a County Business Patterns API endpoint
url = "http://api.census.gov/data/2014/cbp?key=%s&get=EMP,ESTAB,EMPSZES,EMPSZES_TTL,PAYANN&for=state:*" % (API_KEY)
result = requests.get(url)
result.reason
cbp = None
if result.ok:
    data = result.json()
    cbp = pd.DataFrame(data[1:], columns=data[0])
print(result.reason)

OK


At this point, we have 4 data frames:

* git_logs: A history of commit activity for 10 popular open source projects
* posts: Some social media post data
* alternate_posts: Some social media from another source, same timeframe
* cbp: County Business Patterns data from the US Census

## Handle Missing Data

By default, Pandas skips blank rows.  So, you shouldn't have to deal with that too much.

What you do have to deal with is missing values inside a column of data.

There is actually a more-complete discussion of this in the [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) if you need to keep going with these issues.

Let's take a look at `alternate_posts` again.  I left it messy on purpose.  To figure out what's in there, I literally looked at each column for any values at all. I kept all of those.  That's where the columns variable gets its values.

The only data type that I wanted to convert was the index, `CreatedTime`.  If I had wanted to convert other types, I would use a syntax like:

    dtypes = {'SenderUserId': np.int32}
    pd.read_csv(filename, dtype=dtypes, ...)

As it is, there **could** be a couple columns to convert: MediaTypeList and SendUserId.  That's only because other columns use the `Unknown` string when there is a missing value.  Typically I'd leave it as an `np.nan`.  There are benefits to having an easily-identifiable missing value, but as long as I know what's there, I'm OK for now.

In [91]:
alternate_posts_filename = "data/alt_posts-2016-06-06.csv"
columns = ['UniversalMessageId', 'SenderUserId', 'Title', 'Message',
           'CreatedTime', 'Language', 'LanguageCode', 'CountryCode',
           'MediaTypeList', 'Permalink', 'Domain', 'Spam', 'Action Time', 'Location']
alternate_posts = pd.read_csv(alternate_posts_filename,
                              usecols=columns,
                              index_col='CreatedTime',
                              parse_dates=['CreatedTime'])
alternate_posts.MediaTypeList.fillna(value='Unknown', inplace=True)
alternate_posts.SenderUserId.fillna(value='Unknown', inplace=True)
alternate_posts.head()

Unnamed: 0_level_0,UniversalMessageId,SenderUserId,Title,Message,Language,LanguageCode,CountryCode,MediaTypeList,Permalink,Domain,Spam,Action Time,Location
CreatedTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-06-06 13:48:10,WEB_115_sg_57557f1a111,Unknown,Pfizer presents promising data from next gener...,Pfizer presents promising data from next gener...,English,en,DE,Unknown,http://www.worldpharmanews.com/pfizer/3504-pfi...,worldpharmanews.com,False,Mon Jun 06 13:48:10 GMT 2016,Germany
2016-06-06 13:47:14,WEB_115_sg_57557ee2a7,Unknown,Research on Osteoporosis Drugs Market Reveals...,Press Releases\nResearch on Osteoporosis Drugs...,English,en,UN,Unknown,http://news.scoopasia.com/index.php/news/resea...,news.scoopasia.com,False,Mon Jun 06 13:47:14 GMT 2016,Unknown
2016-06-06 13:42:52,WEB_115_sg_57557ddcab,Unknown,Pfizer Says Pivotal Avelumab Study Shows Posit...,Pfizer Says Pivotal Avelumab Study Shows Posit...,English,en,US,Unknown,http://www.quotenet.com/news/stocks/Pfizer-Say...,quotenet.com,False,Mon Jun 06 13:42:52 GMT 2016,United States
2016-06-06 13:39:29,WEB_115_sg_57557d1174,Unknown,[ 0 ] Whitehouse Laboratories Returns To MD&M ...,Whitehouse Laboratories Returns To MD&M East A...,English,en,US,Unknown,http://www.bio-medicine.org/medicine-news-1/Wh...,bio-medicine.org,False,Mon Jun 06 13:39:29 GMT 2016,United States
2016-06-06 13:37:39,WEB_115_sg_57557ca37,Unknown,Orbis Research: United States Antibacterial Dr...,Orbis Research: United States Antibacterial Dr...,English,en,US,Unknown,http://www.medgadget.com/2016/06/orbis-researc...,medgadget.com,False,Mon Jun 06 13:37:39 GMT 2016,United States


**FIXME:** This notebook is left intentionally unfinished.  Come back and keep importing examples.