# Wrangling Data 2

Wrangling data through different formats, with missing data, and working with text.

## Part 1: Text wrangling and regular expressions
In this part we will work with the citation file exported from the [Nature Review Article](https://www.nature.com/articles/s41586-020-2649-2) *Array Programming with NumPy*. Below we read the file into the Python variable `cite` and print the result for you to preview.

In [1]:
# Run but do not modify this code
with open("numpy_nature.txt") as f:
    cite = f.read()

print(cite)

TY  - JOUR
AU  - Harris, Charles R.
AU  - Millman, K. Jarrod
AU  - van der Walt, Stéfan J.
AU  - Gommers, Ralf
AU  - Virtanen, Pauli
AU  - Cournapeau, David
AU  - Wieser, Eric
AU  - Taylor, Julian
AU  - Berg, Sebastian
AU  - Smith, Nathaniel J.
AU  - Kern, Robert
AU  - Picus, Matti
AU  - Hoyer, Stephan
AU  - van Kerkwijk, Marten H.
AU  - Brett, Matthew
AU  - Haldane, Allan
AU  - del Río, Jaime Fernández
AU  - Wiebe, Mark
AU  - Peterson, Pearu
AU  - Gérard-Marchant, Pierre
AU  - Sheppard, Kevin
AU  - Reddy, Tyler
AU  - Weckesser, Warren
AU  - Abbasi, Hameer
AU  - Gohlke, Christoph
AU  - Oliphant, Travis E.
PY  - 2020
DA  - 2020/09/01
TI  - Array programming with NumPy
JO  - Nature
SP  - 357
EP  - 362
VL  - 585
IS  - 7825
AB  - Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential r

### Question 1
There are several authors, each recorded on a separate line beginning with `AU`. Create a Python list of all of the author names formatted as in the file but without the extra characters and whitespace (i.e., without the `AU  - ` or the newline `\n` characters). Your list should be of the form `['Harris, Charles R.', 'Millman, K. Jarrod', ..., 'Oliphant, Travis E.']`. When you are finished, print the resulting list. 

In [3]:
entryList = cite.split("\n")
authorList = [entry[6:] for entry in entryList if entry[:2] == "AU"]
print(authorList)

['Harris, Charles R.', 'Millman, K. Jarrod', 'van der Walt, Stéfan J.', 'Gommers, Ralf', 'Virtanen, Pauli', 'Cournapeau, David', 'Wieser, Eric', 'Taylor, Julian', 'Berg, Sebastian', 'Smith, Nathaniel J.', 'Kern, Robert', 'Picus, Matti', 'Hoyer, Stephan', 'van Kerkwijk, Marten H.', 'Brett, Matthew', 'Haldane, Allan', 'del Río, Jaime Fernández', 'Wiebe, Mark', 'Peterson, Pearu', 'Gérard-Marchant, Pierre', 'Sheppard, Kevin', 'Reddy, Tyler', 'Weckesser, Warren', 'Abbasi, Hameer', 'Gohlke, Christoph', 'Oliphant, Travis E.']


### Question 2
Create a Pandas DataFrame that contains three columns: one for first names, one for middle names, and one for last names for all of the authors. You should use descriptive column names, but you can use the default primary index (the row labels) of 0, 1, 2, etc. Thus, the first few rows of your dataframe might look like the table pictured below. You are welcome to use the results of the prior question to asnwer this problem.

|      | first      | middle     | last         |
| ---- | ---------- | ---------- | ------------ |
| 0	   | Charles    | R.         | Harris       |
| 1	   | K.	        | Jarrod     | Millman      |
| 2	   | Stéfan     | J.         | van der Walt |
| 3    | Ralf       |            | Gommers      |
| 4	   | Pauli      |            | Virtanen     |   

Note that some authors do not have any middle names, in which case you can leave the middle name column blank. When you are finished, display the first 10 rows of the resulting dataframe by calling `pd.head(10)` (where Pandas has been imported as `pd`).

In [4]:
import pandas as pd

In [5]:
authorSeries = pd.Series(authorList)
lastNameSplit = authorSeries.str.split(pat=",",expand=True)
middleNameSplit = lastNameSplit.iloc[:,1].str.split(expand=True)
middleNameSplit.iloc[:,1].fillna(value=" ", inplace=True)
entries = {"first": middleNameSplit.iloc[:,0],
           "middle": middleNameSplit.iloc[:,1],
           "last": lastNameSplit.iloc[:,0]}
names = pd.DataFrame(data=entries)
names.head(10)

Unnamed: 0,first,middle,last
0,Charles,R.,Harris
1,K.,Jarrod,Millman
2,Stéfan,J.,van der Walt
3,Ralf,,Gommers
4,Pauli,,Virtanen
5,David,,Cournapeau
6,Eric,,Wieser
7,Julian,,Taylor
8,Sebastian,,Berg
9,Nathaniel,J.,Smith


### Question 3
Below we extract the abstract from the citation and store it in a string variable `abstract`. Write regular expressions to answer the following questions about the abstract.

1. Print the starting index of everywhere `NumPy` appears in the abstract (i.e., the index of the `N` wherever `NumPy` occurs in the `abstract` string).
2. Print all of the capitalized words in `abstract`, including words with extra capitalized letters like `NumPy`.
3. Print all of the words that immediately follow `NumPy`, but do not include the word `NumPy` itself. Note that in one occurrence it is hyphenated `NumPy-like`, in which case your code can return `-like` or `like` as you prefer.

In [6]:
import re
abstract_query = re.compile(r"AB  - (.+)")
abstract = re.search(abstract_query, cite).group(1)
print(abstract)

Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves1 and in the first imaging of a black hole2. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array object

In [7]:
# 1 Print the starting index of everywhere NumPy appears in the abstract (i.e., the index of the N wherever NumPy occurs in the abstract string).
NumPyIdx = [i.start() for i in re.finditer("NumPy", abstract)]
print("Starting index of everywhere NumPy appears in the abstract:")
print(NumPyIdx)
# 2 Print all of the capitalized words in abstract, including words with extra capitalized letters like NumPy.
capitalized_query = re.compile("[A-Z][A-za-z]+")
capitalized = re.findall(capitalized_query, abstract)
print("All the capitalized words in abstract:")
print(capitalized)
# 3 All of the words that immediately follow NumPy
follow_query = re.compile("NumPy ?-?([A-Za-z]+)")
follow = re.findall(follow_query, abstract)
print("All the words that immediately follow NumPy:")
print(follow)

Starting index of everywhere NumPy appears in the abstract:
[171, 469, 768, 962, 1051]
All the capitalized words in abstract:
['Array', 'NumPy', 'Python', 'It', 'For', 'NumPy', 'Here', 'NumPy', 'Python', 'It', 'NumPy', 'Owing', 'NumPy', 'API']
All the words that immediately follow NumPy:
['is', 'was', 'is', 'like', 'increasingly']


## Part 2: Cleaning up more system logs CSV
In this part we work with a piece of messy tabular data in the form of a poorly formatted csv file containing data about programs running on computer systems. It contains all of the data about system time and memory from Practice 3, but also includes new information about user ids and machine ids, and some data are missing in every column. (The user ids are made up and do not correspond to any real individuals).  

### Question 4
Below, we import the dataset using the Pandas `read_csv` function that creates a dataframe. Run the code; it will preview the first five rows. 

In [8]:
import pandas as pd
import numpy as np
sys_df = pd.read_csv("more_monitor.csv")
sys_df.head()

Unnamed: 0,System Time second,System Memory GB,System Memory MB,System User ID,System Machine ID
0,?,?,?,User ID: yw22,Machine ID: Carrot
1,System Time: 40 second,System Mem: 3 Gb,382 Mb,?,?
2,?,System Mem: 2 Gb,271 Mb,User ID: tp7,Machine ID: Asparagus
3,System Time: 31 second,System Mem: 3 Gb,493 Mb,?,Machine ID: Eggplant
4,System Time: 37 second,System Mem: 3 Gb,411 Mb,?,Machine ID: Asparagus


There are several formatting issues with the default import. Address the following.

1. The `System User ID` and `System Machine ID` contain String data with the redundant information `User ID: ` and `Machine ID: ` in every row that has data. Remove these prefixes so that the columns only contain the user ids and machine ids themselves (for example, the first row should just have `yw22` in the `System User ID` column and `Carrot` in the `System Machine ID` column. 
2. The first three columns for `System Time second`, `System Memory GB` and `System Memory MB` contain numerical data but are currently formatted as strings with redundant prefix information repeating the column label and missing data represented as the string `?` instead of the Numpy `NaN` sentinel value. Fix this so that each value in the first three columns is either a single numerical value or `NaN` (note, you should use the actual `np.NaN` sentinal value, not just the String with the characters `N`, `a`, and `N`). For example, when you are done, the first three columns of the first row should all have `NaN` values, the second row should be `40`, `3`, and `382`, and so on. Note that the rows at index `400` and on have System Time recorded in minutes instead of seconds, be sure to convert these to seconds by mulitplying by 60.
3. Currently the System Memory is split accross two columns, one for the GB and one for the MB. For example, the total memory of the first program is 3 GB and 414 MB. Instead, represent the full system memory in the `System Memory GB` column, and get rid of the `System Memory MB` column. To do so, you need to convert the values in the MB column to GB (1 MB is 0.001 GB) and add that to the GB column, then use the [`drop` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). Missing values should remain missing after this transformation.

When you are finished, `sys_df` should have the above issues corrected. Run both of the cells with `sys_df.head()` and `sys_df.tail()` to show the first and last few rows of your dataframe.

In [9]:
# 1
user_query = re.compile("User ID: (.+)")
sys_df['System User ID'] = sys_df['System User ID'].apply(lambda x: " ".join(re.findall(user_query, x)))
machine_query = re.compile("Machine ID: (.+)")
sys_df['System Machine ID'] = sys_df['System Machine ID'].apply(lambda x: " ".join(re.findall(machine_query, x)))

In [10]:
# 2
sys_df.replace('?',np.NaN)
sys_df['System Time second'] = sys_df['System Time second'].apply(lambda x: " ".join(re.findall('[0-9]+', x)))
sys_df['System Time second'] = sys_df['System Time second'].apply(pd.to_numeric)
sys_df['System Memory GB'] = sys_df['System Memory GB'].apply(lambda x: " ".join(re.findall('[0-9]+', x)))
sys_df['System Memory GB'] = sys_df['System Memory GB'].apply(pd.to_numeric)
sys_df['System Memory MB'] = sys_df['System Memory MB'].apply(lambda x: " ".join(re.findall('[0-9]+', x)))
sys_df['System Memory MB'] = sys_df['System Memory MB'].apply(pd.to_numeric)
sys_df.iloc[400:,0] = sys_df.iloc[400:,0]*60

In [11]:
# 3 
sys_df['System Memory GB'] = sys_df['System Memory GB'] + (sys_df['System Memory MB']*0.001)
sys_df = sys_df.drop(['System Memory MB'], axis=1)

In [12]:
sys_df.head()

Unnamed: 0,System Time second,System Memory GB,System User ID,System Machine ID
0,,,yw22,Carrot
1,40.0,3.382,,
2,,2.271,tp7,Asparagus
3,31.0,3.493,,Eggplant
4,37.0,3.411,,Asparagus


In [13]:
sys_df.tail()

Unnamed: 0,System Time second,System Memory GB,System User ID,System Machine ID
995,1860.0,2.258,zm3,Asparagus
996,1860.0,3.403,bk4,
997,2160.0,3.35,yw22,Carrot
998,2640.0,3.366,yw22,Asparagus
999,1200.0,3.49,bk4,Asparagus


### Question 5
The `sys_df` dataframe from question 3 should now be a little easier to read and use. Answer the following questions about `sys_df`.

1. How many rows are missing data in the `System Machine ID` column?
2. What is the average value of `System Memory GB` among the rows that are missing data in the `System User ID` column? 
3. How many rows are missing data in both the `System Time second` and `System Memory GB` columns?

Note: It is not necessary to complete all of question 4 in order to answer some of these questions, and we will also look at your code for partial credit.

In [14]:
# Put your code to answer the question here
# 1
print("Rows of missing data in the System Machine ID column:")
print(sys_df[sys_df["System Machine ID"]==""].shape[0])
# 2
missing = sys_df[sys_df["System Machine ID"]==""]
print("Average value of System Memory GB among the rows that are missing data in the System User ID column:")
print(missing["System Memory GB"].mean())
# 3
print("Rows of missing data in both the System Time second and System Memory GB columns:")
print(sys_df[(sys_df["System Time second"].isnull()) & (sys_df["System Memory GB"].isnull())].shape[0])

Rows of missing data in the System Machine ID column:
196
Average value of System Memory GB among the rows that are missing data in the System User ID column:
2.8997122302158287
Rows of missing data in both the System Time second and System Memory GB columns:
69


## Part 3: Wrangling FDA JSON Dataset 
In this part we work with a messy JSON dataset containing information about several drugs labels.

### Question 6
Below we import the `FDADrugLabel.json` file into the `labels` variable. This is the same dataset as the practice. The resulting Python object is somewhat messy; we encourage you to explore the data before answering the questions.

In [15]:
import json
with open("FDADrugLabel.json") as f:
    labels = json.load(f)

Answer the following questions.

1. Print the average number of key/value (or name/value) pairs for the drugs.
2. Print the list of all of the `manufacturer_names` without any other information. `manufacturer_names` are not a top level key/name, you will need to search for where they are located and how to extract them.
3. Print how many drugs contain the string `child` anywhere in their `warnings` (including as part of larger strings like `children`). `warnings` is a top level key/name, but some drugs are missing this data and the information is contained in strings within lists of length 1. 

In [16]:
# Put your code to answer the question here
# 1
lenList = [len(i) for i in labels]
print("Average number of key/value pairs: " + str(sum(lenList)/len(lenList)))
# 2 
manufacturer_names = [labels[i]['openfda']['manufacturer_name'] for i in range(len(labels))]
print("List of all manufacturer names:")
print(manufacturer_names)
# 3 
child_query = re.compile("child[a-z]+")

Average number of key/value pairs: 19.25
List of all manufacturer names:
['Nature and Health Beauty Co., Ltd.', 'Silver Star Brands', 'Johnson & Johnson Consumer Inc.', 'Proficient Rx LP', 'Energique, Inc.', 'Energique, Inc.', 'Amerisource Bergen', 'Seroyal USA', 'Proficient Rx LP', 'MODECOS CO., LTD.', 'United Exchange Corp.', 'King Bio Inc.', 'Aurobindo Pharma Limited', 'Energique, Inc.', 'SMART SENSE (Kmart)', 'Mentor Lab', 'BCM Ltd', 'Pearl World INC.', 'Dolgencorp, Inc. (DOLLAR GENERAL & REXALL)', 'Medline Industries, Inc.']
