# 清理与分析数据（替换为你自己的报告标题）
## 收集

#### 导入需要的库

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import display
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)
pd.set_option('max_colwidth',100)

> 提示：你需要收集的文件有：
1. 收集手头文件 `twitter_archive_enhanced.csv`，其中包含了一些主要的推特信息，是本次清洗的主要数据，其中的评分、地位和名字等数据是从 `text` 原文中提取的，但是提取的并不好，评分并不都是正确的，狗的名字和地位也有不正确的 。**如果你想用评分、地位和名字进行分析和可视化，需要评估和清洗这些列。完成这些列的评估和清洗，你可以学到更加实用的技能。**
2. 编程下载收集互联网文件：`image-predictions.tsv`，其中包含了推特图像预测信息，根据推特中的图片预测出狗狗种类；
3. 查询 API 收集额外推特信息 `tweet_json.txt`，如果你无法访问 Twitter 的话，可以直接读取项目可供下载的 `tweet_json.txt` 文件，从中提取所需数据。至少需要提取转发数（`retweet_count`）和喜欢数（`favorite_count`）这两列，**如果你的分析中不需要用到其他列，则不需要收集其他列**。如果提取了其他列只用于清洗，那么这样的清洗没有意义。

##### twitter_archive_enhanced 保存为DataFrame。

In [36]:
df_raw = pd.read_csv('twitter-archive-enhanced.txt',sep=',',header=0,index_col=0)

##### image-prediction.tsv 转换为DataFrame

In [37]:
df_img = pd.read_csv('image-predictions.tsv',sep='\t',header=0,index_col=0)

##### tweet_json.txt 转换为DataFrame格式
+ 出现错误：'ValueError: Trailing data'
+ 解决方法：添加参数line=True
+ 参考：https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas


In [38]:
df_supl = pd.read_json('tweet_json.txt',lines=True)

## 评估

### 针对twitter-archive-enhanced.csv文件的评估

In [39]:
df_raw.sample(5)

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
816014286006976512,,,2017-01-02 20:12:21 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Larry. He has no self control. Tongue still nifty af tho 11/10 https://t....,7.320056e+17,4196984000.0,2016-05-16 00:31:53 +0000,"https://twitter.com/dog_rates/status/732005617171337216/photo/1,https://twitter.com/dog_rates/st...",11,10,Larry,,,,
817171292965273600,,,2017-01-06 00:49:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Tebow. He kindly requests that you put down the coffee and play with him. 13/10 such a g...,,,,https://twitter.com/dog_rates/status/817171292965273600/photo/1,13,10,Tebow,,,,
817777686764523521,,,2017-01-07 16:59:28 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Dido. She's playing the lead role in ""Pupper Stops to Catch Snow Before Resuming Shadow ...",,,,https://twitter.com/dog_rates/status/817777686764523521/video/1,13,10,Dido,doggo,,pupper,
754747087846248448,,,2016-07-17 18:38:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Keith. He's pursuing a more 2D lifestyle. Idiosyncratic af. 12/10 follow your dreams Kei...,,,,https://twitter.com/dog_rates/status/754747087846248448/photo/1,12,10,Keith,,,,
680115823365742593,,,2015-12-24 20:00:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Ozzy. He woke up 2 minutes before he had to be ready for the Christmas party. 9/10 class...,,,,https://twitter.com/dog_rates/status/680115823365742593/photo/1,9,10,Ozzy,,,,


In [26]:
df_raw.shape

(2356, 16)

In [34]:
df_raw.columns

Index(['in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source',
       'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

#### 共有16个特征，2356条记录。

+ 'in_reply_to_status_id',Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
+ 'in_reply_to_user_id', Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
+ 'timestamp',
+ 'source',Utility used to post the Tweet, as an HTML-formatted string. 
+ 'text',The actual UTF-8 text of the status update.tweet中的文本内容
+ 'retweeted_status_id',
+ 'retweeted_status_user_id',
+ 'retweeted_status_timestamp', 
+ 'expanded_urls',
+ 'rating_numerator',评分的分子部分，一般大于10，也可能存在小于10的情况。
+ 'rating_denominator',评分的分母部分，一般为10
+ 'name',来自tweet的文本内容，判断是狗的名字。
+ 'doggo',一种狗的等级，等级一共有四种。
+ 'floofer',一种狗的等级，等级一共有四种。
+ 'pupper',一种狗的等级，等级一共有四种。
+ 'puppo'，一种狗的等级，等级一共有四种。

参考：
+ https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

In [29]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 892420643555336193 to 666020888022790149
Data columns (total 16 columns):
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(2), object(10)
memory usa

In [45]:
df_raw[['rating_numerator','rating_denominator']].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.126486,10.455433
std,45.876648,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0


In [43]:
df_raw.isnull().any()

in_reply_to_status_id          True
in_reply_to_user_id            True
timestamp                     False
source                        False
text                          False
retweeted_status_id            True
retweeted_status_user_id       True
retweeted_status_timestamp     True
expanded_urls                  True
rating_numerator              False
rating_denominator            False
name                          False
doggo                         False
floofer                       False
pupper                        False
puppo                         False
dtype: bool

In [47]:
df_raw.duplicated().sum()

0

#### 评估df_raw小结：
+ index不连续
+ 'timestamp'，'retweeted_status_timestamp'的数据类型是object
+ 'rating_numerator','rating_denominator',最大值分别为1776和170，分别大于均值13和10
+ 'in_reply_to_status_id ','in_reply_to_user_id','retweeted_status_id ','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls'存在null
+ 狗狗的等级分布在四个列中。

### 针对image-prediction.tsv文件的评估

In [49]:
df_img.sample(5)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
793120401413079041,https://pbs.twimg.com/media/CwG6zDfWcAA8jBD.jpg,1,Labrador_retriever,0.724944,True,golden_retriever,0.169744,True,kuvasz,0.035502,True
667915453470232577,https://pbs.twimg.com/media/CUTpj-GWcAATc6A.jpg,1,leatherback_turtle,0.452517,False,boxer,0.196655,True,terrapin,0.160983,False
672082170312290304,https://pbs.twimg.com/media/CVO3KodXAAAj1de.jpg,1,hamster,0.13244,False,toy_poodle,0.123962,True,bubble,0.056212,False
672609152938721280,https://pbs.twimg.com/media/CVWWdKLWEAEnSk7.jpg,1,microwave,0.981946,False,rotisserie,0.007472,False,television,0.005881,False
759047813560868866,https://pbs.twimg.com/media/Coit84_VYAEMtLi.jpg,1,Labrador_retriever,0.778546,True,bathing_cap,0.154254,False,golden_retriever,0.024972,True


In [27]:
df_img.shape

(2075, 11)

In [50]:
df_img.columns

Index(['jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf',
       'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

#### 共有11个特征，2075条记录。

+ jpg_url 是预测的图像资源链接
+ img_num 最可信的预测结果对应的图像编号 → 1 推特中的第一张图片
+ p1 是算法对推特中图片的一号预测 → 金毛犬
+ p1_conf 是算法的一号预测的可信度 → 95%
+ p1_dog 是一号预测该图片是否属于“狗”（有可能是其他物种，比如熊、马等） → True 真
+ p2 是算法对推特中图片预测的第二种可能性 → 拉布拉多犬
+ p2_conf 是算法的二号预测的可信度 → 1%
+ p2_dog 是二号预测该图片是否属于“狗” → True 真

参考：


In [30]:
df_img.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2075 entries, 666020888022790149 to 892420643555336193
Data columns (total 11 columns):
jpg_url    2075 non-null object
img_num    2075 non-null int64
p1         2075 non-null object
p1_conf    2075 non-null float64
p1_dog     2075 non-null bool
p2         2075 non-null object
p2_conf    2075 non-null float64
p2_dog     2075 non-null bool
p3         2075 non-null object
p3_conf    2075 non-null float64
p3_dog     2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(4)
memory usage: 152.0+ KB


In [51]:
df_img.describe()

Unnamed: 0,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0
mean,1.203855,0.594548,0.1345886,0.06032417
std,0.561875,0.271174,0.1006657,0.05090593
min,1.0,0.044333,1.0113e-08,1.74017e-10
25%,1.0,0.364412,0.05388625,0.0162224
50%,1.0,0.58823,0.118181,0.0494438
75%,1.0,0.843855,0.1955655,0.09180755
max,4.0,1.0,0.488014,0.273419


In [52]:
df_img.isna().any()

jpg_url    False
img_num    False
p1         False
p1_conf    False
p1_dog     False
p2         False
p2_conf    False
p2_dog     False
p3         False
p3_conf    False
p3_dog     False
dtype: bool

In [54]:
df_img.duplicated().sun()

66

#### 评估df_img小结：
+ index不连续
+ 存在重复记录66条
+ 图片数量的最大值是4


### 针对tweet_json.txt文件的评估

In [55]:
df_supl.sample(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
652,,,2016-10-27 23:30:09,"[0, 111]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWe...",,0,False,RT @dog_rates: I'm not sure what this dog is doing but it's pretty inspirational. 12/10 https://...,,791784077045166082,791784077045166080,,,,,,False,en,,0.0,0.0,,,,11208,False,"{'created_at': 'Wed Dec 30 06:37:25 +0000 2015', 'id': 682088079302213632, 'id_str': '6820880793...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rate..."
631,,,2016-11-02 00:42:53,"[0, 134]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dog_rates', 'name': 'SpookyWe...","{'media': [{'id': 791672307924209664, 'id_str': '791672307924209664', 'indices': [111, 134], 'me...",0,False,RT @dog_rates: When she says you're a good boy and you know you're a good boy because you're a g...,,793614319594401792,793614319594401792,,,,,,False,en,,0.0,0.0,,,,3646,False,"{'created_at': 'Thu Oct 27 16:06:04 +0000 2016', 'id': 791672322847637504, 'id_str': '7916723228...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rate..."
1710,,,2015-12-25 17:30:01,"[0, 118]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 68044029097769...","{'media': [{'id': 680440290977693696, 'id_str': '680440290977693696', 'indices': [95, 118], 'med...",1578,False,Merry Christmas. My gift to you is this tiny unicorn running into a wall in slow motion. 11/10 h...,,680440374763077632,680440374763077632,,,,,,False,en,,0.0,0.0,,,,559,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rate..."
841,,,2016-08-18 23:55:18,"[0, 109]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 76642325224799...","{'media': [{'id': 766423252247994368, 'id_str': '766423252247994368', 'indices': [110, 133], 'me...",6651,False,This is Shadoe. Her tongue flies out of her mouth at random. Can't have a serious conversation w...,,766423258543644672,766423258543644672,,,,,,False,en,,0.0,0.0,,,,1810,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rate..."
793,,,2016-09-06 16:10:20,"[0, 99]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 77319160137613...","{'media': [{'id': 773191601376137218, 'id_str': '773191601376137218', 'indices': [100, 123], 'me...",11079,False,This is Grey. He's the dogtor in charge of your checkpup today. 12/10 I'd never miss an appointm...,,773191612633579521,773191612633579520,,,,,,False,en,,0.0,0.0,,,,4713,False,,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,"{'id': 4196983835, 'id_str': '4196983835', 'name': 'SpookyWeRateDogs™', 'screen_name': 'dog_rate..."


In [28]:
df_supl.shape

(2352, 31)

In [56]:
df_supl.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'possibly_sensitive_appealable',
       'quoted_status', 'quoted_status_id', 'quoted_status_id_str',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'truncated',
       'user'],
      dtype='object')

In [31]:
df_supl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2352 non-null datetime64[ns]
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null float64
id                               2352 non-null int64
id_str                           2352 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

In [32]:
df_supl[['favorite_count','retweet_count']].describe()

Unnamed: 0,favorite_count,retweet_count
count,2352.0,2352.0
mean,8109.19898,3134.932398
std,11980.795669,5237.846296
min,0.0,0.0
25%,1417.0,618.0
50%,3596.5,1456.5
75%,10118.0,3628.75
max,132318.0,79116.0


<div>
<style>
    .dataframe thead tr:only-child th {
        text-align: right;
    }

### 编程评估

In [7]:
# 使用 pandas 的各种方法评估三个数据集，比如 info value_counts 等

In [8]:
# 你需要添加更多的 code cell 和 markdown cell 来完成所有编程评估

> 提示：
- 完成目测评估和编程评估之后，总结列出你发现的三个数据集中的所有问题；
- 每个问题都要有对应的一句话或几句话描述；
- 最终至少要包含 8 个质量问题和 2 个整洁度问题。

#### 质量
##### `df` 表格（将`dataframe_name1`修改为你的数据集名称，然后删掉括号内的提示）
- 问题的数量可能不同，你可以继续换行添加更多列表项，或者删除多余的 `- `
- 问题描述列表
- 问题描述列表
- 我们自己的答案
- 
- 
- 
- 

##### `dataframe_name2` 表格（将`dataframe_name2`修改为你的数据集名称，然后删掉括号内的提示）
- 问题描述列表
- 问题描述列表
- 
- 


##### `dataframe_name3` 表格（将`dataframe_name3`修改为你的数据集名称，然后删掉括号内的提示）
- 问题描述列表
- 问题描述列表
- 
- 


#### 整洁度
- 问题描述列表
- 问题描述列表
- 



## 清理

> 提示：
- 清理数据集之前需要先备份数据集；
- 按照下面示例的结构：**定义-代码-测试**，对提出的每个问题进行清洗。

In [9]:
# 备份三个数据集


#### 问题描述一

##### 定义
简述一下问题的解决方案

##### 代码

In [10]:
# 解决问题一的代码


##### 测试

In [11]:
# 测试问题一是否正确清理完成


#### 问题描述二

##### 定义
简述一下问题的解决方案

##### 代码

In [12]:
# 解决问题二的代码


##### 测试

In [13]:
# 测试问题二是否正确清理完成


In [14]:
# 你需要添加更多的 code cell 和 markdown cell 来完成所有清洗


### 存储清理后的主数据集

In [15]:
# 将清理后的主数据集保存为 csv 文件或者 SQLite 数据库文件


## 分析和可视化

> 提示：
- 在 Notebook 使用 pandas 或 SQL 分析主数据集，并生成至少三（3）个独立的结论。
- 在 Notebook 中使用 Python 绘图库或在 Tableau 中至少生成一（1）个可视化图表。


In [16]:
# 分析或可视化代码


与上面分析或可视化相关的见解或说明

In [17]:
# 分析或可视化代码


与上面分析或可视化相关的见解或说明

In [18]:
# 你需要添加更多的 code cell 和 markdown cell 来完成所有分析和可视化


更多说明或总结等

> 提示：在完成 Notebook 的所有内容之后，还需要完成两篇文本和图片组成的 PDF 报告。因为这两篇报告中只是文字和图片，不需要包含代码，你可以使用文字编辑软件，比如 Word 来完成：
- 创建一个 300-600 字的书面报告，命名为 `wrangle_report.pdf`，在该报告中简要描述你的数据整理过程。这份报告可以看作是一份内部文档，供你的团队成员查看交流。
- 创建一个 250 字以上的书面报告，命名为 `act_report.pdf`，在该报告中，你可以与读者交流观点，展示你使用整理过的数据生成的可视化图表。这份报告可以看作是一份外部文档，如博客帖子或杂志文章。

> 提示：提交项目前建议删除 Notebook 中的所有提示性文字和注释，只保留自己的 Markdown 文本和代码注释。