# Clean Data

The purpose of this notebook is to go through the captions and metadata obtained from the Jose Mourinho videos and clean them. This includes formatting the metadata for use and removing videos that don't have Jose Mourinho in them. It also removes as many duplicate videos as possible.

In [132]:
import pandas as pd
import numpy as np
import pickle

In [133]:
with open('Data/meta_data.pkl', 'rb') as f:
    meta_data = pickle.load(f)

In [134]:
with open('Data/captions.pkl', 'rb') as f:
    captions = pickle.load(f)

In [135]:
captions = pd.DataFrame(captions)
captions.columns=['Caption','Video Id']

In [136]:
len(captions)

1653

In [137]:
meta = pd.DataFrame(meta_data)

In [138]:
meta_df = meta['items'].apply(pd.Series)[0].apply(pd.Series)
snip = meta_df['snippet'].apply(pd.Series)
stats = meta_df['statistics'].apply(pd.Series)
local = snip['localized'].apply(pd.Series)
thumbs = snip['thumbnails'].apply(pd.Series)

  result = result.union(other)
  index = _union_indexes(indexes, sort=sort)
  result = result.union(other)


In [139]:
final_df = meta_df.merge(snip,how='outer', left_index=True, right_index=True)

In [140]:
final_df = final_df.merge(stats,how='outer', left_index=True, right_index=True)
final_df = final_df.merge(local,how='outer', left_index=True, right_index=True)

In [141]:
final_df.columns

Index(['kind', 'etag', 'id', 'snippet', 'statistics', '0_x', 'categoryId',
       'channelId', 'channelTitle', 'defaultAudioLanguage', 'defaultLanguage',
       'description_x', 'liveBroadcastContent', 'localized', 'publishedAt',
       'tags', 'thumbnails', 'title_x', '0_y', 'commentCount', 'dislikeCount',
       'favoriteCount', 'likeCount', 'viewCount', '0_x', 'title_y',
       'description_y', '0_y'],
      dtype='object')

In [142]:
final_df.drop(['kind','etag','snippet','statistics','0_x','localized','categoryId','channelId',
              'defaultAudioLanguage','liveBroadcastContent','thumbnails',
              '0_y','title_y','description_y'],axis='columns',inplace=True)

In [143]:
final_df.columns = ['Video Id', 'ChannelTitle','Language','Description','Published Datetime',
            'Tags','Video Title','Comment #','Dislike #','Favorite #','Like #','View #']

final_df.head()

Unnamed: 0,Video Id,ChannelTitle,Language,Description,Published Datetime,Tags,Video Title,Comment #,Dislike #,Favorite #,Like #,View #
0,cYF_QjP_cMU,Manchester United,en-GB,Subscribe to Manchester United on YouTube at h...,2018-10-22T15:00:04.000Z,"[manchester united, mufc, man utd, manutd, mu,...",Mourinho & Lukaku Press Conference | Mancheste...,409,70,0,2942,141380
1,QZDwtT8WNew,BeanymanSports,en-GB,Get the Onefootball app here! = http://bit.do/...,2017-07-24T07:34:57.000Z,"[Football, Soccer, Beanyman, BeanymanSports, z...",Real Madrid 1-1 Man Utd (1-2 Pens) - Zinedine ...,50,23,0,171,34979
2,7uIQRSRArY4,BeanymanSports,en-GB,Press conference with Manchester United manage...,2018-10-02T21:57:28.000Z,"[Football, Soccer, Beanyman, BeanymanSports, J...",Manchester United 0-0 Valencia - Jose Mourinho...,496,93,0,539,87797
3,J-r39kp6jiw,BeanymanSports,en-GB,Post-match press conference with Man United ma...,2017-01-21T19:22:25.000Z,"[Football, Soccer, Beanyman, BeanymanSports, P...",Stoke 1-1 Manchester United - Jose Mourinho Fu...,69,29,0,197,31648
4,wZPjwbIOnno,Goal,,Interview with Manchester United manager Jose ...,2016-09-15T14:30:03.000Z,"[amazing, incredible, footage, video, transfer...","Goal Exclusive interview - Mourinho, the maste...",10,8,0,93,6906


In [144]:
df = final_df.merge(captions,on='Video Id',how='inner')

In [145]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1651 entries, 0 to 1650
Data columns (total 13 columns):
Video Id              1651 non-null object
ChannelTitle          1651 non-null object
Language              359 non-null object
Description           1651 non-null object
Published Datetime    1651 non-null object
Tags                  1460 non-null object
Video Title           1651 non-null object
Comment #             1635 non-null object
Dislike #             1639 non-null object
Favorite #            1651 non-null object
Like #                1639 non-null object
View #                1651 non-null object
Caption               1651 non-null object
dtypes: object(13)
memory usage: 180.6+ KB


In [136]:
pd.set_option('display.max_rows', 1400)
test

Unnamed: 0,Video Id,ChannelTitle,Language,Description,Published Datetime,Tags,Video Title,Comment #,Dislike #,Favorite #,Like #,View #,Caption,years
0,cYF_QjP_cMU,Manchester United,en-GB,Subscribe to Manchester United on YouTube at h...,2018-10-22,"[manchester united, mufc, man utd, manutd, mu,...",Mourinho & Lukaku Press Conference | Mancheste...,409.0,70.0,0,2942.0,141380,okay welcome to the press conference you do ha...,2018
2,7uIQRSRArY4,BeanymanSports,en-GB,Press conference with Manchester United manage...,2018-10-02,"[Football, Soccer, Beanyman, BeanymanSports, J...",Manchester United 0-0 Valencia - Jose Mourinho...,496.0,93.0,0,539.0,87797,Yes person to do a partir appliances red meat ...,2018
3,J-r39kp6jiw,BeanymanSports,en-GB,Post-match press conference with Man United ma...,2017-01-21,"[Football, Soccer, Beanyman, BeanymanSports, P...",Stoke 1-1 Manchester United - Jose Mourinho Fu...,69.0,29.0,0,197.0,31648,okay the money just here to take questions on ...,2017
4,wZPjwbIOnno,Goal,,Interview with Manchester United manager Jose ...,2016-09-15,"[amazing, incredible, footage, video, transfer...","Goal Exclusive interview - Mourinho, the maste...",10.0,8.0,0,93.0,6906,I love that feeling of influence the players c...,2016
5,4pSt84kixJE,BeanymanSports,en-GB,Find the Best Ticket Prices https://ticket-com...,2018-02-11,"[Football, Soccer, Beanyman, BeanymanSports, N...",Newcastle 1-0 Manchester United - Jose Mourinh...,297.0,58.0,0,208.0,48747,the big game is fast approaching but wait you ...,2018
6,j-6MiviBmt8,TheRedDEVILS,,Manchester United vs. Leicester City | 4-1 |...,2016-09-24,"[Manchester United vs. Leicester City, Manches...",Manchester United vs. Leicester City | 4-1 |...,0.0,0.0,0,3.0,24,so the champions swept aside today an emotion ...,2016
7,TGr06-Cv398,Football Time,,"José Mário dos Santos Mourinho Félix, GOIH (bo...",2016-09-08,"[Zlatan Ibrahimovic (Football player), Mesut O...",Jose Mourinho | The Special One | Documentary,50.0,36.0,0,844.0,170934,no one could ever have been quite prepared for...,2016
8,pgKmNc27OnA,Foot OMG,,Chelsea vs Manchester United 0-1\nJose Mourinh...,2018-05-19,"[Foot OMG, football]",Chelsea 1-0 Man Utd ▪️ Jose Mourinho Post Matc...,0.0,1.0,0,0.0,36,toyatte margins few chances you think manually...,2018
9,68m1UwmUvOI,MrBeanyman,,Saint-Etienne 0-1 Manchester United (Agg 0-4) ...,2017-02-22,"[Football, Soccer, Beanyman, BeanymanSports]",Saint-Etienne 0-1 Manchester United (Agg 0-4) ...,46.0,8.0,0,214.0,30365,[Music] Jesse he wanted a focus he wanted a so...,2017
10,9Q1BjvXfuvA,WeareManchesterUnited,,Jose Mourinho Post Match Conference vs Newcast...,2018-10-06,[Jose Mourinho Post Match Conference vs Newcas...,Jose Mourinho Post Match Conference vs Newcast...,74.0,20.0,0,264.0,40839,I'm happy for the fans and for the players you...,2018


In [146]:
rows_to_drop=[231,210,605,1564,712,1100,1228,461,958,963,1377,1357,327,895,781,1570,1054,274,943,1124,1376,1165,1278,1282,816,482,1520,1636,1318,1150,636,700,723,1084,1438,800,1035,1629,415,769,1162,567,632,689,1471,273,1193,571,647,841,1099,1190,1222,1528,1535,514,893,1393,1461,1631,261,1065,155,1482,1000,1091,1097,1334,1456,624,721,731,1057,1175,1271,1390,1474,1567,322,325,718,980,684,719,973,998,1494,1624,1268,1428,1512,1112,1146,1178,1219,1297,1299,1340,1400,1405,1486,239,444,337,406,502,591,619,664,859,883,941,76,88,221,285,541,939,783,1096,1243,1308,1364,1590,1601,1621,80,152,349,390,410,703,754,867,940,1019,1427,229,519,548,724,164,252,442,524,532,705,794,884,889,903,1216,1518,1650,66,44,135,211,264,317,379,826,856,904,966,1081,1117,1301,1398,1607,1593,1409,1459,1510,183,400,403,587,693,748,804,805,817,938,987,1033,1225,1280,1368,1053,1379,267,1006,1132,7, 30, 53, 57, 75, 77, 96, 124, 128, 132, 138, 147, 161, 174, 176, 187, 191, 193, 204, 213, 216, 237, 241, 242, 258, 266, 269, 271, 277, 299, 309, 319, 328, 347, 351, 352, 357, 371, 388, 393, 411, 417, 425, 427, 429, 437, 450, 458, 500, 501, 506, 507, 525, 537, 554, 556, 568, 590, 600, 631, 638, 645, 646, 650, 671, 674, 675, 680, 685, 688, 694, 695, 698, 710, 716, 728, 730, 735, 736, 737, 738, 740, 744, 746, 763, 766, 779, 780, 790, 795, 831, 832, 836, 842, 846, 851, 860, 868, 891, 914, 920, 929, 931, 946, 950, 968, 971, 975, 976, 978, 983, 984, 985, 986, 993, 1001, 1010, 1013, 1022, 1030, 1038, 1039, 1042, 1045, 1051, 1055, 1086, 1106, 1109, 1113, 1135, 1141, 1148, 1159, 1169, 1173, 1191, 1195, 1197, 1200, 1211, 1212, 1221, 1230, 1233, 1237, 1244, 1250, 1253, 1261, 1262, 1270, 1287, 1293, 1296, 1311, 1321, 1325, 1326, 1336, 1341, 1346, 1350, 1367, 1370, 1380, 1406, 1408, 1417, 1435, 1436, 1437, 1439, 1441, 1445, 1453, 1457, 1462, 1465, 1466, 1469, 1480, 1488, 1490, 1493, 1496, 1500, 1506, 1514, 1516, 1522, 1523, 1529, 1530, 1541, 1556, 1557, 1558, 1559, 1561, 1565, 1568, 1577, 1578, 1585, 1588, 1594, 1600, 1610, 1619, 1635]

In [147]:
spurs = df.drop(rows_to_drop)

In [148]:
spurs=spurs[spurs['Video Title'].str.contains("Mourinho")]

In [149]:
spurs = spurs.loc[spurs['ChannelTitle'] != 'News24']

In [150]:
spurs = spurs.loc[spurs['ChannelTitle'] != 'The United Stand']

In [151]:
spurs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 917 entries, 0 to 1648
Data columns (total 13 columns):
Video Id              917 non-null object
ChannelTitle          917 non-null object
Language              211 non-null object
Description           917 non-null object
Published Datetime    917 non-null object
Tags                  808 non-null object
Video Title           917 non-null object
Comment #             908 non-null object
Dislike #             913 non-null object
Favorite #            917 non-null object
Like #                913 non-null object
View #                917 non-null object
Caption               917 non-null object
dtypes: object(13)
memory usage: 100.3+ KB


In [152]:
filename = 'Data/spurs.pkl'
pickle.dump(spurs, open(filename, 'wb'))