### A notebook dedicated to working with federal recall data and related sets

1. [Source](https://opendata.socrata.com/Government/Most-common-reasons-for-food-recalls/9iuc-3wkn)
1. [json file](https://opendata.socrata.com/api/views/9iuc-3wkn/rows.json?accessType=DOWNLOAD)

In [4]:
# imports matlab and numpy
%pylab inline 
import os
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix

Populating the interactive namespace from numpy and matplotlib


In [5]:
path = r'C:\Users\danielle.leong\Documents\data' # all data stored here
filename = 'fed_recall_data.json'
fpath = path + '\\' + filename
csv_path = path + '\\' + r'Most_common_reasons_for_food_recalls.csv'

In [6]:
df1 = pd.read_csv(csv_path)
df1

Unnamed: 0,REASON,COMPANY
0,,1
1,Undeclared drug substance: tadalafil,1
2,Undeclared drug substance: Desmethyl Sibutramine,1
3,Potential contamination with Listeria monocyto...,1
4,Air leakage,1
5,Undeclared drug substance: Sulfoaildenafil,1
6,Contains undeclared drug ingredient,3
7,may not meet package expiration date,1
8,Potential for botulism,3
9,McNeil Consumer Healthcare is initiating this ...,1


### In order to process data:
- Create levels/groups
  - pull out first word as a word to group by
- Compile group counts

Note: Requires [hierarchal indexing][1]/[reshaping][2]

[1]: http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-hierarchical
[2]: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-stacking

In [24]:
strings = df1['REASON']
glist = []
for i in strings:
    if type(i) is str:
        l1 = i.split()
        s1 = l1[0]
        glist.append(s1)

Alternatively, the above code can be altered to obtain only the unique values:
```python
strings = df1['REASON']
glist = []
for i in strings:
    if type(i) is str:
        l1 = i.split()
        s1 = l1[0]
        if s1 not in glist: 
            glist.append(s1)
```            

In [30]:
glist.column()

AttributeError: 'Series' object has no attribute 'column'

In [38]:
# print(glist)
glist = pd.Series(glist)
df2 = pd.concat([df1, glist], axis = 1)
df2.columns = ['REASON', 'COMPANY', 'CAT']
print(df2.columns)

Index(['REASON', 'COMPANY', 'CAT'], dtype='object')


In [51]:
tuples = list(zip(df2['CAT'], df2['REASON']))
index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
df_mi = pd.Series(df2['COMPANY'], index = index)
df_mi

# this kind of works but not really, successfully created index smushes?
# index smushes go in order entered, list 'names' indicates which is which -- no extra fxn

first          second                                                                                                                                                                                                            
Undeclared     NaN                                                                                                                                                                                                                  NaN
               Undeclared drug substance: tadalafil                                                                                                                                                                                 NaN
Potential      Undeclared drug substance: Desmethyl Sibutramine                                                                                                                                                                     NaN
Air            Potential contamination with Listeria monocytogenes            

In [40]:
df2['CAT'].value_counts()

Undeclared          28
Potential            6
may                  5
Listeria             4
May                  4
Salmonella           3
undeclared           3
Glass                2
Air                  2
Contains             2
Adverse              2
Possible             2
E.                   2
Bacillus             2
Allergen             2
glass                1
McNeil               1
Sterility            1
salmonella           1
Off                  1
Clostridium          1
Breakage             1
Sulfites             1
Lead                 1
Inaccurate           1
low                  1
Insects              1
Particulate          1
Insect               1
potential            1
                    ..
Presence             1
Improper             1
Uneviscerated        1
Uncharacteristic     1
Meat                 1
Visible              1
Metal                1
cGMPs                1
Subpotent            1
the                  1
Difficulty           1
allergen             1
Under-proce

### A quick note on object types  
Just read the below code:
```python
print(strings[0]) 
>>> nan
print(type(strings[0]))
>>> <class 'float'>
print(strings[0] is float)
>>> False
print(np.nan is float)
>>> False
print(strings[0] == np.nan)
>>> False
print(type(strings[0]) is float) # be sure to compare type() not object itself to class
>>> True
```  
and also
```python
x = 'blah'
print(type(x) is not float)
>>> True
print(type(x) == float)
>>> False
```

### Quick notes on JSON files in pandas
***
**previous json code, not sure of issues:**  
``` python
import json
import requests

json_web = r'https://opendata.socrata.com/api/views/9iuc-3wkn/rows.json?accessType=DOWNLOAD'
df_test = json.loads(requests.get(json_web).text)

list(df_test.keys())

df = pd.read_json(fpath)
# throwing error, see
# http://stackoverflow.com/questions/33559660/error-while-reading-json-file

# attempting solution with meta as columns as data as fields, didn't work either
pd.DataFrame(df_test["data"], columns = [x["label"] for x in df_test["meta"]])
```
***
Another bit of code to glimpse at data
``` python
df_test["meta"]
```
***
Sample of how this JSON import works when proper
``` python
# example from stack
d = json.loads(requests.get('https://data.gov.in/node/305681/datastore/export/json').text)

print(list(d.keys()))

pd.DataFrame(d["data"], columns=[x["label"] for x in d["fields"]])
```

