## Reporting: wrangle_report
* Create a **300-600 word written report** called "wrangle_report.pdf" or "wrangle_report.html" that briefly describes your wrangling efforts. This is to be framed as an internal document.

#  Gathering data

The first step in the data wrangling process is data gathering. The data of interest are located and formatted in three different ways:

1. The first data called `twitter_archive_enhanced.csv` was readily available to import using pandas' read_csv function. 


2. The second data named `image_predictions.tsv` was a result of the Deep Learning based prediction for the classiification of breeds of dogs hosted on the cloud server and stored in a tab seperated values file. The data was fetched and saved in the current folder using the python's *with open* statement for efficient file handling operation. Thereafter, the file was read using the pandas' read_csv function with *\t* passed as argument to the function to handle tab seperated values. The data was named `image_predictions` as the variable name for the pandas dataframe.


3. The third data was scraped from the twitter website with focus on the twitter handle *WeRateDogs*. [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs) is a twitter account where people upload their beautiful dogs for people to rate and write comments. The [Tweepy](https://www.tweepy.org/) was required as well as Twitter Authentication codes. However, after the registration and application for the developer's account, I was denied the access, so I had to settle for the `tweet-json.txt` file provided. The required columns were extracted into a pandas dataframe with a variable named `new_tweets`

## Assessing data

- The visual assessment was done by opening the files in Microsoft Excel software. Both `image_predictions` and `new_tweets` dataframes had to be exported as csv file using the pandas' *to_csv* function and downloaded to local disk for visual assessment using Excel.  

- The programmatic assessment was done using pandas functions including *head*, *sample*, *describe*, *info*, *isnull*, *sum*, *duplicated*, *unique*, *sort_values* and *shape*

From these assessments, the following issues which were divided into Quality and Tidiness were discovered:

#### Quality issues
1. twitter-archive-enhanced table: Missing records in in_reply_to_status_id column

2. twitter-archive-enhanced table: Missing records in in_reply_to_user_id column

3. twitter-archive-enhanced table: Missing records in retweeted_status_id column

4. twitter-archive-enhanced table: Missing records in retweeted_status_user_id column

5. twitter-archive-enhanced table: Missing records in retweeted_status_timestamp column

6. twitter-archive-enhanced table: Missing records in expanded_urls column

7. twitter-archive-enhanced table: Erroneous datatypes in two columns (timestamp and retweeted_status_timestamp)

8. twitter-archive-enhanced table: The dog name column has incorrect values: "None", "such", "a", "an", "the", "this", "all", "by", "his", etc.

#### Tidiness issues
1.  twitter-archive-enhanced table: One column (type) instead of four columns (doggo, floofer, pupper and puppo) in archived table

2. twitter-archive-enhanced table: tweet_id column in archived table duplicated in new_tweets and predictions tables


## Cleaning data

 - Most of the data cleaning efforts include missing data, erroneous data types and incomplete data. Most columns in the missing data were dropped because more than 92% of the data are missing. The only exception was the *expanded_urls* column but was eventually dropped because it will not contribute to our data analysis process and insight (my opinion).

- The _timestamp_ column was converted to datetime from string for easy manipulation and also to remove the trailing +0000.

- Regular expression and string extraction process were used to extract dog names that were missing in the *name* column from the *text* column. However, I would have to go through over 2000 rows carefully to extract the preceeding or proceeding words from the sentence. As this was going to take days, I had to stop at some point.

- Concerning the tidiness issue, the columns (doggo, floofer, pupper and puppo) were melted to one column named _type_ to make the dataset tidy and easy to analyze. Lastly, the three dataframes were merged together "on" the *tweet_id* column because it is the only common column among them.
