# Wrangling

## Gathering

There are three data sources that I collected data from: 
 1. CSV that was handed to me
 2. Downloaded TSV file from online source
 3. JSON data from Twitter's API

### 1. Downloading and Loading the CSV File into a Dataframe
The CSV file was downloaded from Udactiy and stored on my local machine in the same folder location as my Jupyter Notebook. The file was then loaded into a dataframe using pandas as described in the cells below.

### 2. Downloading the TSV File from the Internet
The next file needed to be downloaded programmatically from the internet using python. A URL was provided where I could download the file. Using the OS and Requests library in python, I was able to create a folder on my machine and make a request to the URL to download the file. 

After downloading the file, I was able to name the file based on the URL. Then, I stored the data onto a new dataframe using pandas.

### 3. Accessing JSON data from Twitter's API
The next data set would have to be accessed using Twitter's API. After following the instructions in the class, I was able to setup my own Twitter developer access and get my own authentication and token keys. I followed the instructions on how to access and get data from the Twitter API and used the Tweepy API class to get data. 

First, I needed to get the Tweet IDs from the Archive file (first file) that I loaded. After getting the Tweet IDs stored in a variable, I ran some simple tests to understand how the data was being returned using a single Tweet ID. I made sure I understood how to use the Tweepy API, get the JSON data, and download the JSON data for one Tweet ID before doing the entire list of tweet IDs. The rest of the steps I took are listed next to the cells below.

#### 3a. Downloading all of the JSON data into a text file
I stored all of the Tweet IDs from the archive data frame into a list which I would use when sending requests to the Twitter API. I created a simple script to help me loop through all of the Tweet IDs and store each JSON data for each Tweet in its own line in the text file. 

Because the script took a very long time to run (around 30 minutes), and since the script stores the text file on my local machine after it runs I didn't really need to run it again, so the script is commented in the jupyter notebook.

#### 3b. Reading the text file line by line and loading the data into a dataframe
This section took the longest time in the wrangling process. Multiple iterations were taken to get the data correctly and I had to look at the JSON data in the text file multiple times for various tweets to ensure that I was getting the data properly.

I created the below script to load the text file and then read each line and store the attributes that I wanted into a dataframe. The minimum required attributes were Tweet ID, Retweet Count, and Favorite Count. I had to look through the text file to get the actual attribute names in the JSON data using one Tweet ID as an example. 

I also decided that I wanted to pull Full Text, URL, and User Mentions just to see if I could. This would prove to be more difficult than I thought. The Full Text was easy to pull as it was just like the other minimum required fields. However, the URL and User Mentions were both nested in the JSON data tree. So, I had to go multiple layers to access both. 

I noticed that User Mentions were empty for a lot of data (more than 2,000) so I decided to remove that from the script. 

URL was nested for the majority of tweets in the same location. However, there were some exceptions that came up which is why I added the try-except blocks to the first part of the dataframe script. Some URLs were nested in different parts of the JSON data.

#### 3c. Creating a new Dataframe Ignoring Exceptions
Next, I wanted to create a new dataframe now that I have all of my exceptions. I wanted to get the list of all the tweet IDs that had the exception then run a modified version of the script I created in step 3b to avoid any exceptions. 

Getting the URLs for each of the exceptions proved to be challenging as the URL was nested in various places for these tweets. I didn't have enough time to go through each of the exceptions or write the code to handle all scenarios. So, if an exception existed, I just ignored the URL.

## Assessing

I made sure to assess the data in two ways: visually and programatically. I visually inspected each of the data frames first and tried to spot some tidyness or quality issues. After visually inspecting the data, I went ahead and strated doing some programmatic assessments. I made sure all of the data types were correct and I checked to see if there were any duplicates. 

After assessing the data, I collected and grouped all of the issues into two sections: Tidyness Issues and Quality Issues. I made sure to focus my cleaning efforts on the tidyness issues first. 

## Cleaning

I tried to make sure that all of the tidyness issues were resolved first. So I merged some dataframes and also split out some attributes that had multiple attributes in them. 

I also made sure to follow the Define, Code, and Test practices when approaching all of these issues, as you will see in my Jupyter Notebook. 

Below were the items that were cleaned. 

## Quality Issues

#### df_archive table
1. Dog names don't seem to be correct (visually noticed 'the' and 'a' as some dog names)
2. Timestamp and retweeted_status_timestamp is represented as an object and should be converted to a datetime.
3. Expanded URLs seem to consist of duplicate Twitter URLs and also URLs to other sites (like go fund me) all jumbled into a single attribute
4. None values are being counted when they should be NaN
5. Small portion of rating denominators are not 10 and are inconsistent.
6. Retweets seem like duplicate data in both archive and image_pred tables
7. IDs should be integers instead of float.

#### df_image_pred table
8. Inconsistent punctuation of the dog types in the p1, p2, and p3 columns.

## Tidyness Issues

#### df_archive table
1. Multiple attributes represented when it can be combined for dog type (i.e. doggo, floofer, pupper, puppo) even rating (numerator and denominator)
2. URLs are in the text attributes

#### df_tweepy table
3. Only has a two attributes that are new (retweet count and favorite count). This should be merged to the df_archive table
4. Multiple attributes on the table that are very similar with the prediction confidence score getting worse with more predictions (i.e. prediction, prediction number, and dog (true or false))