# 项目描述
**你的目标：**
清洗 WeRateDogs 推特数据，创建有趣可靠的分析和可视化。推特档案很大，但是只包括基本的推特信息。对 "Wow!" 进行收集、评估和清洗，是分析和可视化应该做的。

**你在这个项目中的任务如下：**
* 清洗数据包括：
    * 收集数据
    * 评估数据
    * 清洗数据
* 对清洗过的数据进行储存、分析和可视化
* 汇报 1) 你的数据清洗过程 和 2) 你的数据分析和可视化

# 项目目的

## Uda目标

### 收集数据
* 学员能够从各种来源和文件格式中收集数据
    - （DONE）在“项目详细信息”页面上使用至少三（3）个不同来源。
    - （DONE）在“项目详细信息”页面上，使用至少三（3）种不同的文件格式。首先将每一条数据导入到一个单独的 pandas 数据框中。

### 评估数据
* 学员能够以可视化和编程方式评估数据的质量和整洁度。
    - （DONE）可视化评估：每张收集的数据都显示在 Jupyter Notebook 中，以便进行可视化评估。 一旦显示出来，数据可以在外部应用程序（如 Excel，文本编辑器）中进行评估。
    - （DONE）编程评估：使用 pandas 的功能和/或方法来评估数据。
* 学员能够彻底对数据集进行评估
    - （DONE）学员能够检测到至少 八（8）个数据质量问题和两（2）个整洁度问题，包括待清理问题以满足项目要求。每一个问题用一到几句话记录下来。

### 清理数据
* 学员根据数据清理过程中的步骤来逐步完成他们的清理工作
    - （DONE）清理过程中的定义，编码和测试步骤都有明确的记录。
* 学员能够使用编程方式彻底清理数据集
    - （DONE）在清理之前，保存原始数据的副本。
    - （DONE）评估阶段确定的所有问题都可以通过 Python 和 pandas 成功清理，并包括满足项目要求所需的清理任务。
    - （DONE）学员需要创建一个整洁的主数据集（或者多个数据集，如果有必要的话）与所有收集的数据片段。

### 存储并处理清洁过的数据
* 学员能够存储已经收集、评估并清理过的数据集
    - （DONE）学员将他们收集、评估和清理过的主数据集保存到 CSV 文件或 SQLite 数据库中。
* 学生能够根据自己所掌握的数据采取行动来得出结论（例如通过分析，可视化和/或模型)
    - （DONE）使用 Jupyter Notebook 中的 pandas 或 SQL 分析主数据集，并生成至少三（3）个独立的结论。
    - （DONE）在 Jupyter Notebook 中，使用 Python 绘图库或在 Tableau 中至少生成一（1）个标记的可视化对象。
    - （DONE）学员必须在他们的清洗数据中明确他们之后分析和可视化所依据的数据是建立在评估和清理的基础上。

### 报告
* 学员能够思考并描述他们的数据清洗过程
    - （DONE）学员需要言简意赅地介绍他们的数据清理。 这一文件（wrangle_report.pdf）大约只需要300-600字。
* 学员在他们清洗过的数据集中能够发现并描述出结论
    - （DONE）学员发现至少三（3）个结论，其中至少包含一个（1）可视化。这一文件（act_report.pdf）至少需要 250 个字。

### 项目文件
* 学员提交的文件夹中是否包含所有必需的文件
    - （DONE）wrangle_act.ipynb
    - （DONE）wrangle_report.pdf
    - （DONE）act_report.pdf
    - （DONE）并包括所有的数据集文件，如存储的主数据集，并使用在项目提交页面中指定的文件名和扩展名。   

## 个人期望结论假设
* 评分高低应该和转发数量/喜爱程度存在关系
* 评分高低和狗的种类存在关系
* 评分高低和狗的地位存在关系
* 评论的词云分析
* 各种评分的箱形分布图
* 评分的总体分布情况

**所以需要完成tweet-archive-master文档应包含以下内容，相应数据来源在括号中标识：**
* tweet_id 推特账号 （df_tweeter）
* text 推文（df_tweeter）
* timestamp 时间戳（df_tweeter）
* retweet-count 转推数（df_json）
* favorite-count 喜爱数 （df_json）
* rating 评分（df_tweeter）
* dog-name 狗名（df_tweeter）
* dog-status 狗的地位（df_tweeter）
* dog-types 狗的品种(df_pred)
* dog-prediction probabily 品种预测的概率(df_pred)



In [31]:
# 导入所需模块
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import os
import seaborn as sns
from IPython.display import display

# 收集数据
## 下载数据

In [2]:
# 下载image_predictions.tsv数据
url1 = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'
with open('image_predictions.tsv' , 'wb') as file:
    image_f = requests.get(url1)
    file.write(image_f.content)
image_predictions = pd.read_csv('image_predictions.tsv' , sep = '\t') 
#image_predictions.head(5)

In [3]:
# 下载twitter-archive-enhanced.csv数据
url2 = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/twitter-archive-enhanced.csv'
with open('twitter-archive-enhanced.csv' , 'wb') as file:
    twitter_f = requests.get(url2)
    file.write(twitter_f.content)
twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv' , sep = ',') 
#twitter_archive_enhanced.head(5)

In [4]:
# 由于众所周知的原因，无法注册tweeter，好在udacity准备了相关json数据的TXT格式以应用该project，下载tweet_json.txt数据
url3 = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/tweet_json.txt'

with open('tweet_json.txt' , 'wb') as file:
    tweet_f = requests.get(url3)
    file.write(tweet_f.content)
#test = open('tweet_json.txt' , 'r')
#a = test.readlines()
#a

tweet_data = pd.DataFrame(columns=['tweet_id','retweet_count','favorite_count'])
with open('tweet_json.txt','r')as file:
    for line in file.readlines():
        dic = json.loads(line)
        #print(dic['id'])
        tweet_id = dic['id']
        retweet_count = dic['retweet_count']
        favorite_count = dic['favorite_count']
        tweet_data = tweet_data.append({'tweet_id' :tweet_id,'retweet_count':retweet_count,'favorite_count':favorite_count},ignore_index=True)
#tweet_data.head(5)


In [5]:
#读取备份文件进行操作
df_pred= image_predictions.copy()
df_twitter = twitter_archive_enhanced.copy()
df_json = tweet_data.copy()

# 评估数据
分别对于df_twitter, df_pred, df_json三个数据集进行检查

## df_twitter数据集的检查

In [6]:
df_twitter

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [7]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [8]:
#由于狗的地位只能有一种，观察发现部分数据狗的种类有重复分类现象，通过编程评估的方法进行筛选，得出存在重复标注现象的tweet_id如下：
temp = df_twitter.drop(['in_reply_to_status_id','in_reply_to_user_id','timestamp','source','text','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp','expanded_urls','rating_numerator','rating_denominator','name'],axis=1)
#temp
temp_dog = pd.melt(temp, id_vars=["tweet_id"], value_vars=["doggo","floofer","pupper","puppo"])
#temp_dog
temp_dog = temp_dog[temp_dog.value!='None']
temp_dog = temp_dog.drop(['variable'],axis=1)
a = temp_dog.drop_duplicates(subset=['tweet_id'],keep='first')
b = temp_dog.drop_duplicates(subset=['tweet_id'],keep=False)
dupl_dog = a.append(b).drop_duplicates(subset=['tweet_id'],keep=False)
dupl_dog_list = list(dupl_dog.tweet_id)
dupl_dog.tweet_id

191     855851453814013952
200     854010172552949760
460     817777686764523521
531     808106460588765185
565     802265048156610565
575     801115127852503040
705     785639753186217984
733     781308096455073793
778     775898661951791106
822     770093767776997377
889     759793422261743616
956     751583847268179968
1063    741067306818797568
1113    733109485275860992
Name: tweet_id, dtype: int64

In [9]:
#缺少狗的地位的部分数据
none_list = temp['tweet_id'][(temp.doggo == 'None') & (temp.floofer == 'None') & (temp.pupper == 'None') & (temp.puppo =='None')]
none_list

0       892420643555336193
1       892177421306343426
2       891815181378084864
3       891689557279858688
4       891327558926688256
5       891087950875897856
6       890971913173991426
7       890729181411237888
8       890609185150312448
10      890006608113172480
11      889880896479866881
13      889638837579907072
15      889278841981685760
16      888917238123831296
17      888804989199671297
18      888554962724278272
19      888202515573088257
20      888078434458587136
21      887705289381826560
22      887517139158093824
23      887473957103951883
24      887343217045368832
25      887101392804085760
26      886983233522544640
27      886736880519319552
28      886680336477933568
30      886267009285017600
31      886258384151887873
32      886054160059072513
33      885984800019947520
               ...        
2326    666411507551481857
2327    666407126856765440
2328    666396247373291520
2329    666373753744588802
2330    666362758909284353
2331    666353288456101888
2

In [10]:
#采用编程检查的方法评估数据
temp_dog.value.value_counts()
df_twitter.name.value_counts()
df_twitter.rating_numerator.value_counts()
df_twitter.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [11]:
df_twitter.text.sample(20)

1633    This is Obi. He got camera shy. 12/10 https://...
636     This is Fiona. She's an extremely mediocre cop...
1259    We 👏🏻 only 👏🏻 rate 👏🏻 dogs. Pls stop sending i...
669     RT @dog_rates: This is Leela. She's a Fetty Wo...
1702    Say hello to Sadie. She's a Tortellini Sidewin...
852     This is my dog. Her name is Zoey. She knows I'...
1051    For anyone who's wondering, this is what happe...
179                          @Marc_IRL pixelated af 12/10
851     This is Tove. She's a Balsamic Poinsetter. Sur...
220     Say hello to Boomer. He's a sandy pupper. Havi...
1174    This is Charleson. He lost his plunger. Looked...
795     This is Bungalo. She uses that face to get wha...
64                        @RealKentMurphy 14/10 confirmed
1230    Here we see an extremely rare Bearded Floofmal...
1309    Say hello to Cupcake. She's an Icelandic Dippe...
1926    This is Stanley. Yes he is aware of the spoon'...
2263    This is Kial. Kial is either wearing a cape, w...
1172    This i

In [12]:
'''
#从文本中筛选dogstatus的例子
dog_lists = ['pupper', 'puppo', 'doggo', 'floofer']
for i in range(0,len(twitter_data_copy)):
    text = twitter_data_copy.text[i]
    for dog_status in dog_lists:
       if dog_status in text:
            twitter_data_copy.type[i] = dog_status
'''

"\n#从文本中筛选dogstatus的例子\ndog_lists = ['pupper', 'puppo', 'doggo', 'floofer']\nfor i in range(0,len(twitter_data_copy)):\n    text = twitter_data_copy.text[i]\n    for dog_status in dog_lists:\n       if dog_status in text:\n            twitter_data_copy.type[i] = dog_status\n"

- **tweet_id的类型为int64，in_reply_to_status_id & in_reply_to_user_id 的类型为64位浮点，实际上这列不需要进行计算，我们需要将所有的ID列都转化为字符串格式**（Q）
- 
https://github.com/hamzaben86/Data-Wrangling-Twitter-Data/blob/master/wrangle_act.ipynb

https://github.com/alexanderluiscampino/Twitter-Data-Wrangling-WeRateDogs/blob/master/wrangle_act.ipynb

https://github.com/spinks/WeRateDogs/blob/master/wrangle_act.ipynb

https://github.com/wanderly0501/Data-Wrangling-of-WeRateDogs-Tweet-Archive/blob/master/wrangle_act.ipynb

- **timestamp时间戳的数据格式为object，应当转换为timedate** (Q)
- **部分name列信息缺失，需要补全狗的名字，同时部分狗的名字提取不正确** (Q)
- **狗的种类缺失** (Q)
- **狗的种类有重复标注现象** (Q)
- **expanded_urls只有2297 non-null object，相比表格的2356行缺失数据，需要补全** (Q)
- **狗的分类用了四列，应该合为一列**（T）
- **缺失retweeted_count 和 favorite_count 需要从df_json文件中提取增加**（Q）
- **source列的信息没有用，需要删除**（T）
- **部分Rating评分的分值不正确**（Q）
- **为了后续统计分析不受影响，所有缺失的数据应该被标记为np.nan的逻辑形式，而不是'None'的字符串形式**（Q）



## df_pred数据集的检查

In [13]:
df_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [14]:
df_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [15]:
#筛选出不是狗狗的部分
df_pred['tweet_id'][(df_pred.p1_dog==False) & (df_pred.p2_dog==False) & (df_pred.p3_dog==False)]

6       666051853826850816
17      666104133288665088
18      666268910803644416
21      666293911632134144
25      666362758909284353
29      666411507551481857
45      666786068205871104
50      666837028449972224
51      666983947667116034
53      667012601033924608
56      667065535570550784
69      667188689915760640
73      667369227918143488
77      667437278097252352
78      667443425659232256
93      667549055577362432
94      667550882905632768
96      667724302356258817
98      667766675769573376
100     667782464991965184
106     667866724293877760
107     667873844930215936
112     667911425562669056
115     667937095915278337
117     668142349051129856
118     668154635664932864
123     668226093875376128
130     668291999406125056
132     668466899341221888
140     668544745690562560
               ...        
1839    837482249356513284
1844    838916489579200512
1847    839290600511926273
1851    840370681858686976
1853    840696689258311684
1869    844580511645339650
1

- **tweet_id 和 img_num的类型为int64，实际上这列不需要进行计算，我们需要将这两列都转化为字符串格式**（Q）

## df_json数据集的检查

In [16]:
df_json

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,8.924206e+17,8842.0,39492.0
1,8.921774e+17,6480.0,33786.0
2,8.918152e+17,4301.0,25445.0
3,8.916896e+17,8925.0,42863.0
4,8.913276e+17,9721.0,41016.0
5,8.910880e+17,3240.0,20548.0
6,8.909719e+17,2142.0,12053.0
7,8.907292e+17,19548.0,66596.0
8,8.906092e+17,4403.0,28187.0
9,8.902403e+17,7684.0,32467.0


In [17]:
df_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 3 columns):
tweet_id          2352 non-null float64
retweet_count     2352 non-null float64
favorite_count    2352 non-null float64
dtypes: float64(3)
memory usage: 55.2 KB


- **转帖数和喜爱数的类型应为int64**（Q）

## 问题汇总：

#### 数据完整性问题
##### df_twitter
* Done C1_缺失`retweeted_count` 和 `favorite_count` 需要从`df_json`文件中提取增加
* Done C2_狗的品种缺失需要从`df_pred`中提取补全

#### 数据质量问题
##### df_twitter
* Q1_部分`name`列信息缺失，需要补全狗的名字，同时部分狗的名字提取不正确
* Q2_`tweet_id`的类型为int64，`in_reply_to_status_id` & `in_reply_to_user_id` 的类型为64位浮点，实际上这列不需要进行计算，我们需要将所有的ID列都转化为字符串格式
* Q3_`timestamp`时间戳的数据格式为object，应当转换为timedate
* Q4_狗的地位有重复标注现象，需要进行更正
* Q5_部分Rating评分的分值不正确
* Q6_为了后续统计分析不受影响，所有缺失的数据应该被标记为np.nan的逻辑形式，而不是'None'的字符串形式

##### df_pred
* Q7_`tweet_id` 和 `img_num`的类型为int64，实际上这列不需要进行计算，我们需要将这两列都转化为字符串格式

##### df_json
* Q8_`retweet_count`&`favorite_count`的类型应为int64

#### 数据整洁度问题
##### df_twitter
* Done T1_`source`,`expanded_urls`,`in_reply_to_status_id`,`in_reply_to_user_id`,`retweeted_status_id`,`retweeted_status_user_id`,`retweeted_status_timestamp`列的信息没有用，需要删除
* Done T2_狗的地位用了四列，应该合为一列



# 清洗数据
## 添加缺失数据 & 清理无用数据

### 定义
C1_缺失`retweeted_count` 和 `favorite_count` 需要从`df_json`文件中提取增加

### 代码

In [18]:
df_twitter_clean = pd.merge(df_twitter, df_json,on='tweet_id')

### 测试

In [19]:
df_twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1517 entries, 0 to 1516
Data columns (total 19 columns):
tweet_id                      1517 non-null int64
in_reply_to_status_id         49 non-null float64
in_reply_to_user_id           49 non-null float64
timestamp                     1517 non-null object
source                        1517 non-null object
text                          1517 non-null object
retweeted_status_id           97 non-null float64
retweeted_status_user_id      97 non-null float64
retweeted_status_timestamp    97 non-null object
expanded_urls                 1480 non-null object
rating_numerator              1517 non-null int64
rating_denominator            1517 non-null int64
name                          1517 non-null object
doggo                         1517 non-null object
floofer                       1517 non-null object
pupper                        1517 non-null object
puppo                         1517 non-null object
retweet_count                 1517 

### 定义
 T1_`source`,`expanded_urls`,`in_reply_to_status_id`,`in_reply_to_user_id`,`retweeted_status_id`,`retweeted_status_user_id`,`retweeted_status_timestamp`列的信息没有用，需要删除

### 代码

In [20]:
df_twitter_clean.drop(['source','expanded_urls','in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id','retweeted_status_user_id', 'retweeted_status_timestamp'], axis = 1, inplace = True)

### 测试

In [21]:
df_twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1517 entries, 0 to 1516
Data columns (total 12 columns):
tweet_id              1517 non-null int64
timestamp             1517 non-null object
text                  1517 non-null object
rating_numerator      1517 non-null int64
rating_denominator    1517 non-null int64
name                  1517 non-null object
doggo                 1517 non-null object
floofer               1517 non-null object
pupper                1517 non-null object
puppo                 1517 non-null object
retweet_count         1517 non-null float64
favorite_count        1517 non-null float64
dtypes: float64(2), int64(3), object(7)
memory usage: 154.1+ KB


### 定义
C2_狗的品种缺失预测需要从df_pred中提取补全

### 代码

In [22]:
df_pred_clean = df_pred.copy()
df_pred_clean.drop(['jpg_url','img_num'],axis = 1, inplace = True)
df_twitter_clean = pd.merge(df_twitter_clean, df_pred_clean,on='tweet_id')

### 测试

In [23]:
df_twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 21 columns):
tweet_id              1349 non-null int64
timestamp             1349 non-null object
text                  1349 non-null object
rating_numerator      1349 non-null int64
rating_denominator    1349 non-null int64
name                  1349 non-null object
doggo                 1349 non-null object
floofer               1349 non-null object
pupper                1349 non-null object
puppo                 1349 non-null object
retweet_count         1349 non-null float64
favorite_count        1349 non-null float64
p1                    1349 non-null object
p1_conf               1349 non-null float64
p1_dog                1349 non-null bool
p2                    1349 non-null object
p2_conf               1349 non-null float64
p2_dog                1349 non-null bool
p3                    1349 non-null object
p3_conf               1349 non-null float64
p3_dog                1349 non-null

### 定义
T2_狗的地位用了四列，应该合为一列

Q4_狗的地位有重复标注现象，需要进行更正

### 代码

In [24]:
#扫描text文档，填充狗的地位
df_twitter_clean['type'] = None
dog_lists = ['pupper', 'puppo', 'doggo', 'floofer']
for i in range(0,len(df_twitter_clean)):
    text = df_twitter_clean.text[i]
    for dog_status in dog_lists:
        if dog_status in text:
            df_twitter_clean.type[i] = dog_status
df_twitter_clean

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,...,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,type
0,891815181378084864,2017-07-31 00:18:03 +0000,This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie,,,,,...,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,
1,891689557279858688,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,13,10,Darla,,,,,...,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,
2,891327558926688256,2017-07-29 16:00:24 +0000,This is Franklin. He would like you to stop ca...,12,10,Franklin,,,,,...,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True,
3,891087950875897856,2017-07-29 00:08:17 +0000,Here we have a majestic great white breaching ...,13,10,,,,,,...,Chesapeake_Bay_retriever,0.425595,True,Irish_terrier,0.116317,True,Indian_elephant,0.076902,False,
4,890729181411237888,2017-07-28 00:22:40 +0000,When you watch your owner call another dog a g...,13,10,,,,,,...,Pomeranian,0.566142,True,Eskimo_dog,0.178406,True,Pembroke,0.076507,True,
5,890609185150312448,2017-07-27 16:25:51 +0000,This is Zoey. She doesn't want to be one of th...,13,10,Zoey,,,,,...,Irish_terrier,0.487574,True,Irish_setter,0.193054,True,Chesapeake_Bay_retriever,0.118184,True,
6,890006608113172480,2017-07-26 00:31:25 +0000,This is Koda. He is a South Australian decksha...,13,10,Koda,,,,,...,Samoyed,0.957979,True,Pomeranian,0.013884,True,chow,0.008167,True,
7,889638837579907072,2017-07-25 00:10:02 +0000,This is Ted. He does his best. Sometimes that'...,12,10,Ted,,,,,...,French_bulldog,0.991650,True,boxer,0.002129,True,Staffordshire_bullterrier,0.001498,True,
8,889278841981685760,2017-07-24 00:19:32 +0000,This is Oliver. You're witnessing one of his m...,13,10,Oliver,,,,,...,whippet,0.626152,True,borzoi,0.194742,True,Saluki,0.027351,True,
9,888917238123831296,2017-07-23 00:22:39 +0000,This is Jim. He found a fren. Taught him how t...,12,10,Jim,,,,,...,golden_retriever,0.714719,True,Tibetan_mastiff,0.120184,True,Labrador_retriever,0.105506,True,


In [25]:
#df_twitter_clean.type.value_counts()

In [26]:
#228+80+27+3+1752

In [28]:
none_list = df_twitter_clean['tweet_id'][(df_twitter_clean.doggo == 'None') & (df_twitter_clean.floofer == 'None') & (df_twitter_clean.pupper == 'None') & (df_twitter_clean.puppo =='None')]
none_dog_list = list(none_list)
len(none_dog_list)

1136

In [32]:
#筛选出有重复标记的数据段
with pd.option_context('max_colwidth', 200):
    display(df_twitter_clean[df_twitter_clean['tweet_id'].isin(dupl_dog_list)][['tweet_id','name','text','doggo','floofer','pupper','puppo']])

Unnamed: 0,tweet_id,name,text,doggo,floofer,pupper,puppo
99,855851453814013952,,Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel,doggo,,,puppo
104,854010172552949760,,"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk",doggo,floofer,,
273,801115127852503040,Bones,This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj,doggo,,pupper,
334,785639753186217984,Pinot,This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd,doggo,,pupper,
439,759793422261743616,Maggie,"Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll",doggo,,pupper,
482,751583847268179968,,Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8,doggo,,pupper,
542,741067306818797568,just,This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,doggo,,pupper,
575,733109485275860992,,"Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda",doggo,,pupper,


In [33]:
#with pd.option_context('max_colwidth', 200):
    #display(df_twitter_clean[df_twitter_clean['tweet_id'].isin(none_dog_list)][['tweet_id','text','doggo','floofer','pupper','puppo']])

In [34]:
df_twitter_clean[df_twitter_clean['tweet_id'].isin(dupl_dog_list)].index

Int64Index([99, 104, 273, 334, 439, 482, 542, 575], dtype='int64')

In [35]:
with pd.option_context('max_colwidth', 200):
    display(df_pred[df_pred['tweet_id'].isin(dupl_dog_list)][['tweet_id','jpg_url','p1_dog','p2_dog','p3_dog']])

Unnamed: 0,tweet_id,jpg_url,p1_dog,p2_dog,p3_dog
1155,733109485275860992,https://pbs.twimg.com/media/CiyHLocU4AI2pJu.jpg,True,True,False
1201,741067306818797568,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,True,True,True
1292,751583847268179968,https://pbs.twimg.com/media/Cm4phTpWcAAgLsr.jpg,True,False,False
1350,759793422261743616,https://pbs.twimg.com/media/CotUFZEWcAA2Pku.jpg,True,True,True
1407,770093767776997377,https://pbs.twimg.com/media/CkjMx99UoAM2B1a.jpg,True,True,True
1446,775898661951791106,https://pbs.twimg.com/media/CiyHLocU4AI2pJu.jpg,True,True,False
1507,785639753186217984,https://pbs.twimg.com/media/CucnLmeWAAALOSC.jpg,False,False,False
1609,801115127852503040,https://pbs.twimg.com/media/Cx4h7zHUsAAqaJd.jpg,True,True,True
1616,802265048156610565,https://pbs.twimg.com/media/CyI3zXgWEAACQfB.jpg,True,True,True
1644,808106460588765185,https://pbs.twimg.com/media/Czb4iFRXgAIUMiN.jpg,True,True,True


In [36]:
#根据对于text内信息的描述，以及df_pre中的图像和预测，调整更正重复标注的字段
#经过分析发现以下重复的原因有三类：
#第一类：标注提取错误：153，159，356，463，780，更正为正确的地位
#第二类：两只狗，通常为亲子，包含两种地位：428，456，626，665，722，871，917，选择小狗的pupper标注，因为狗狗的亲子关系因为小狗而存在。
#第三类：不是狗：565，删除地位信息

df_twitter_clean.loc[153,'doggo'] = 'None'
df_twitter_clean.loc[159,'doggo'] = 'None'
df_twitter_clean.loc[356,'doggo'] = 'None'
df_twitter_clean.loc[428,'doggo'] = 'None'
df_twitter_clean.loc[456,'doggo'] = 'None'
df_twitter_clean.loc[463,'doggo'] = 'None'
df_twitter_clean.loc[565,'pupper'] = 'None'
df_twitter_clean.loc[565,'doggo'] = 'None'
df_twitter_clean.loc[626,'doggo'] = 'None'
df_twitter_clean.loc[665,'doggo'] = 'None'
df_twitter_clean.loc[722,'doggo'] = 'None'
df_twitter_clean.loc[780,'pupper'] = 'None'
df_twitter_clean.loc[871,'doggo'] = 'None'
df_twitter_clean.loc[917,'doggo'] = 'None'

In [37]:
df_twitter_clean['dog_status'] = 'None'
for i in range(0,len(df_twitter_clean)):
    if df_twitter_clean.doggo[i] != 'None':
        df_twitter_clean.dog_status[i] = 'doggo'
    elif df_twitter_clean.floofer[i] != 'None':
        df_twitter_clean.dog_status[i] = 'floofer'
    elif df_twitter_clean.pupper[i] != 'None':
        df_twitter_clean.dog_status[i] = 'pupper'
    elif df_twitter_clean.puppo[i] != 'None':
        df_twitter_clean.dog_status[i] = 'puppo'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [38]:
df_twitter_clean

Unnamed: 0,tweet_id,timestamp,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,...,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog,type,dog_status
0,891815181378084864,2017-07-31 00:18:03 +0000,This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie,,,,,...,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True,,
1,891689557279858688,2017-07-30 15:58:51 +0000,This is Darla. She commenced a snooze mid meal...,13,10,Darla,,,,,...,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False,,
2,891327558926688256,2017-07-29 16:00:24 +0000,This is Franklin. He would like you to stop ca...,12,10,Franklin,,,,,...,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True,,
3,891087950875897856,2017-07-29 00:08:17 +0000,Here we have a majestic great white breaching ...,13,10,,,,,,...,0.425595,True,Irish_terrier,0.116317,True,Indian_elephant,0.076902,False,,
4,890729181411237888,2017-07-28 00:22:40 +0000,When you watch your owner call another dog a g...,13,10,,,,,,...,0.566142,True,Eskimo_dog,0.178406,True,Pembroke,0.076507,True,,
5,890609185150312448,2017-07-27 16:25:51 +0000,This is Zoey. She doesn't want to be one of th...,13,10,Zoey,,,,,...,0.487574,True,Irish_setter,0.193054,True,Chesapeake_Bay_retriever,0.118184,True,,
6,890006608113172480,2017-07-26 00:31:25 +0000,This is Koda. He is a South Australian decksha...,13,10,Koda,,,,,...,0.957979,True,Pomeranian,0.013884,True,chow,0.008167,True,,
7,889638837579907072,2017-07-25 00:10:02 +0000,This is Ted. He does his best. Sometimes that'...,12,10,Ted,,,,,...,0.991650,True,boxer,0.002129,True,Staffordshire_bullterrier,0.001498,True,,
8,889278841981685760,2017-07-24 00:19:32 +0000,This is Oliver. You're witnessing one of his m...,13,10,Oliver,,,,,...,0.626152,True,borzoi,0.194742,True,Saluki,0.027351,True,,
9,888917238123831296,2017-07-23 00:22:39 +0000,This is Jim. He found a fren. Taught him how t...,12,10,Jim,,,,,...,0.714719,True,Tibetan_mastiff,0.120184,True,Labrador_retriever,0.105506,True,,


In [39]:
df_twitter_clean.type.value_counts()

pupper    156
doggo      42
puppo      19
Name: type, dtype: int64

In [41]:
df_twitter_clean.dog_status.value_counts()

None       1137
pupper      148
doggo        44
puppo        16
floofer       4
Name: dog_status, dtype: int64

In [48]:
#整理后发现dog_status还有部分数据和我自己通过text提取出的数据type不一致，经过观察发现，是由于大写字母没有考虑导致，dog_status的数据为正确数据。
with pd.option_context('max_colwidth', 200):
    display(df_twitter_clean[(df_twitter_clean['dog_status']!= 'None') & (df_twitter_clean['type']!=df_twitter_clean['dog_status'])][['tweet_id','type','dog_status','text']])

Unnamed: 0,tweet_id,type,dog_status,text
201,819015337530290176,,doggo,RT @dog_rates: This is Bo. He was a very good First Doggo. 14/10 would be an absolute honor to pet https://t.co/AdPKrI8BZ1
202,819004803107983360,,doggo,This is Bo. He was a very good First Doggo. 14/10 would be an absolute honor to pet https://t.co/AdPKrI8BZ1
263,803638050916102144,,pupper,Pupper hath acquire enemy. 13/10 https://t.co/ns9qoElfsX
367,778027034220126208,,pupper,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq
475,752660715232722944,,doggo,Hooman used Pokeball\n*wiggle*\n*wiggle*\nDoggo broke free \n10/10 https://t.co/bWSgqnwSHr
503,749317047558017024,,floofer,This is Blu. He's a wild bush Floofer. I wish anything made me as happy as bushes make Blu. 12/10 would frolic with https://t.co/HHUAnBb6QB
505,748932637671223296,,doggo,Say hello to Divine Doggo. Must be magical af. 13/10 would be an honor to pet https://t.co/BbcABzohKb
533,743222593470234624,,pupper,This is a very rare Great Alaskan Bush Pupper. Hard to stumble upon without spooking. 12/10 would pet passionately https://t.co/xOBKCdpzaa
561,737445876994609152,,floofer,Just wanted to share this super rare Rainbow Floofer in case you guys haven't seen it yet. 13/10 colorful af https://t.co/CaG9MzD3WT
573,733822306246479872,,floofer,This is Moose. He's a Polynesian Floofer. Dapper af. 10/10 would pet diligently https://t.co/mVfqRdppTL


In [53]:
df_twitter_clean = df_twitter_clean.drop(['type','doggo','floofer','pupper','puppo'],axis=1)

### 测试

In [54]:
df_twitter_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 18 columns):
tweet_id              1349 non-null int64
timestamp             1349 non-null object
text                  1349 non-null object
rating_numerator      1349 non-null int64
rating_denominator    1349 non-null int64
name                  1349 non-null object
retweet_count         1349 non-null float64
favorite_count        1349 non-null float64
p1                    1349 non-null object
p1_conf               1349 non-null float64
p1_dog                1349 non-null bool
p2                    1349 non-null object
p2_conf               1349 non-null float64
p2_dog                1349 non-null bool
p3                    1349 non-null object
p3_conf               1349 non-null float64
p3_dog                1349 non-null bool
dog_status            1349 non-null object
dtypes: bool(3), float64(5), int64(3), object(7)
memory usage: 172.6+ KB


### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

### 定义

### 代码

### 测试

In [76]:
df_twitter_test = df_twitter.copy()
df_json_test = df_json.copy()
df_twitter_clean = pd.merge(df_twitter_test, df_json_test,on='tweet_id')
df_twitter_clean

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,retweet_count,favorite_count
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,8842,39492
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,6480,33786
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,4301,25445
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,8925,42863
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,9721,41016
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,,3240,20548
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,,2142,12053
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,,19548,66596
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,,4403,28187
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,,7684,32467


In [87]:
'''
#生成词云参考链接 https://blog.csdn.net/fontthrone/article/details/72775865
df_wordcloud = df_twitter_clean.text
df_wordcloud
'''

'\n#生成词云参考链接 https://blog.csdn.net/fontthrone/article/details/72775865\ndf_wordcloud = df_twitter_clean.text\ndf_wordcloud\n'