## Penultimate notebook, creating the building blocks of our final dataset!

In the previous notebooks we have done cleaning of the zindi training data, as well as the kaggle dataset we generated ourselves from downloading metadata from the Kaggle API. In this notebook we load our previously generated datasets and generate the following csvs:

-FULL_kaggle_1s = pd.read_csv('FULL_kaggle_1s.csv'), containing all the health-related kaggle samples

-kaggle_0s = pd.read_csv('kaggle_0s.csv'), containing all the non-health related kaggle samples

-zindi_1s = pd.read_csv('zindi_1s.csv'), containing all the zindi samples (which are all health related)

-kaggle_testing = pd.read_csv('kaggle_1s_testing.csv'), a smaller subportion of health related kaggle samples for testing

plan:
load zindi training data, delete all indicator columns, replace with one column '1'
load the size+usability filtered kaggle data, add columns for each keyword, populate with '1' if dataset has that keyword and '0' otherwise


In [3]:
import pandas as pd
import numpy as np
import re


In [4]:
#loading the previously generated cleaned and/or filtered datasets
kaggle_df = pd.read_csv('clean_descriptions_kaggle_nonhealth_related.csv', encoding='utf-8')
zindi_training = pd.read_csv('train_CLEAN.csv', encoding='utf-8', engine='python')
kaggle_health_related_df = pd.read_csv('clean_descriptions_kaggle_health_related.csv')
FULL_kaggle_health_related = pd.read_csv('clean_description_FULLkagglemetadata_health_related.csv')

In [48]:
len(kaggle_health_related_df)
kaggle_health_related_df['description'][0]

'dataset consists 180k+ recipes 700k+ recipe reviews covering 18 years user interactions uploads foodcom formerly geniuskitchen used following paper generating personalized recipes historical user preferences bodhisattwa prasad majumder shuyang li jianmo ni julian mcauley emnlp 2019 https arxivorg pdf 190900105pdf'

In [49]:
kaggle_df['description'][0]

'canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [50]:
kaggle_df['description'].isna().value_counts()

False    1007
True       39
Name: description, dtype: int64

In [51]:
indices = np.where(kaggle_df['description'].isna()==True)

In [52]:
list(indices[0])

[53,
 137,
 138,
 139,
 203,
 289,
 461,
 659,
 664,
 666,
 671,
 677,
 692,
 711,
 712,
 725,
 733,
 736,
 743,
 749,
 752,
 761,
 764,
 781,
 814,
 833,
 834,
 838,
 841,
 842,
 871,
 884,
 972,
 977,
 988,
 1009,
 1018,
 1027,
 1040]

In [53]:
#all the nan description rows do have titles! 
kaggle_df['title'][list(indices[0])]

53                         eod data dow jones stocks
137                            video object tracking
138                         news brazilian newspaper
139                         news brazilian newspaper
203                    thrinaxodon broomistega 3d ct
289                                  crime vancouver
461                         computer parts cpus gpus
659          embeddings glove crawl etc torch cached
664                               dogs gone sideways
666        airline delay cancellation data 2009 2018
671            air pollution paulo brazil since 2013
677          yearly tables big soccer leagues europe
692                                          lanl_ft
711                              oxford iiit dataset
712     eur usd forex pair historical data 2002 2019
725                               mnist preprocessed
733                        full keras pretrained top
736                             cifar10 preprocessed
743                      bay area bike sharing

In [54]:
kaggle_df.health_related.value_counts()

0.0    1046
Name: health_related, dtype: int64

In [55]:
zindi_training.Text

0       centers biomedical research excellence cobre p...
1       research regenerative medicine h2strongintrodu...
2       catholic health association india chai pthe ca...
3                quality improvement initiatives diabetes
4       provision thalassemia drugs disposables h2stro...
                              ...                        
2990              rats could help reduce global tb burden
2991    exploratory analyses adherence strategies data...
2992    study vaccines diarrhoeal diseases lower respi...
2993    regional engagement stimulation fund human imm...
2994    graphic design services consultancy pstrongobj...
Name: Text, Length: 2995, dtype: object

## the only information we really need is text and target 3
The following cells combine the title into the description text so that the title appears first.

In [56]:
string1 = kaggle_df.title[0]
string2 = kaggle_df.description[0]
string1 + " " + string2

'canadian federal election results timeseries canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [57]:
kaggle_df['description'][0]

'canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [58]:
kaggle_df.title[0] + " " + kaggle_df.description[0]

'canadian federal election results timeseries canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [63]:
kaggle_df['description'].fillna("")

0       canadians elect representatives house commons ...
1       related paper sichkar v n kolyubin effect vari...
2       airbnb successfully disrupted traditional hosp...
3       dataset contains 13 million sarcastic comments...
4       dataset used blog post https clorichelcom blog...
                              ...                        
1041    theres story behind every dataset heres opport...
1042    united state federal highway administration fh...
1043    works focuses upon creating data set pandas q ...
1044    dataset com fotos tiradas com cmera de um ipho...
1045    attempt figure books judged blurb cover wanted...
Name: description, Length: 1046, dtype: object

In [1]:
#function to combine strings contained in two series objects
def combine_strings(str_series1, str_series2):
    str_series1 = str_series1.fillna("")
    str_series2 = str_series2.fillna("")
    new_series = str_series1 + " " + str_series2
    return new_series

In [65]:
#create a small testing dataset with previously collected filtered by usability and size
#health related kaggle datasets
kaggle_health_related_df['Text'] = combine_strings(kaggle_health_related_df['title'], kaggle_health_related_df['description'])

In [67]:
kaggle_health_related_df['Text'][0]

'foodcom recipes interactions dataset consists 180k+ recipes 700k+ recipe reviews covering 18 years user interactions uploads foodcom formerly geniuskitchen used following paper generating personalized recipes historical user preferences bodhisattwa prasad majumder shuyang li jianmo ni julian mcauley emnlp 2019 https arxivorg pdf 190900105pdf'

In [68]:
#create the text column for the FULL kaggle health related metadata, 654 1s
FULL_kaggle_health_related['Text'] = combine_strings(FULL_kaggle_health_related['title'], FULL_kaggle_health_related['description'])

In [69]:
FULL_kaggle_health_related['Text'].isna().value_counts()

False    654
Name: Text, dtype: int64

In [70]:
FULL_kaggle_health_related['Text'][0]

'foodcom recipes interactions dataset consists 180k+ recipes 700k+ recipe reviews covering 18 years user interactions uploads foodcom formerly geniuskitchen used following paper generating personalized recipes historical user preferences bodhisattwa prasad majumder shuyang li jianmo ni julian mcauley emnlp 2019 https arxivorg pdf 190900105pdf'

In [71]:
#this is a new df of ALL the kaggle health related examples with no filter, 654 sames
FULL_kaggle_1s = FULL_kaggle_health_related[['Text','health_related']]

In [72]:
FULL_kaggle_1s['Text'][0]

'foodcom recipes interactions dataset consists 180k+ recipes 700k+ recipe reviews covering 18 years user interactions uploads foodcom formerly geniuskitchen used following paper generating personalized recipes historical user preferences bodhisattwa prasad majumder shuyang li jianmo ni julian mcauley emnlp 2019 https arxivorg pdf 190900105pdf'

In [73]:
#save to csv
FULL_kaggle_1s.to_csv('FULL_kaggle_1s.csv')

In [77]:
#create the text column for the kaggle 0s
kaggle_df['Text'] = combine_strings(kaggle_df['title'], kaggle_df['description'])

In [78]:
kaggle_df['Text'][0]

'canadian federal election results timeseries canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [79]:
kaggle_df['Text'].isna().value_counts()

False    1046
Name: Text, dtype: int64

In [80]:
kaggle_df['Text'][0]

'canadian federal election results timeseries canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [81]:
kaggle_df['health_related']

0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
1041    0.0
1042    0.0
1043    0.0
1044    0.0
1045    0.0
Name: health_related, Length: 1046, dtype: float64

In [86]:
#this is testing set
kaggle_health_related_df['Text'] = combine_strings(kaggle_health_related_df['title'], kaggle_health_related_df['description'])

In [87]:
#this is the testing dataset
kaggle1s = kaggle_health_related_df[['Text','health_related']]

In [88]:
#this is testing set
kaggle1s['health_related'].value_counts()

1.0    158
Name: health_related, dtype: int64

In [89]:
#save as the smaller testing set
kaggle1s.to_csv('kaggle_1s_testing.csv')

In [90]:
#this goes into the training data, kaggle 0s that came from the filtered size/usability kaggle data
kaggle_0s = kaggle_df[['Text','health_related']]

In [91]:
kaggle_0s.head(10)

Unnamed: 0,Text,health_related
0,canadian federal election results timeseries c...,0.0
1,traffic signs preprocessed related paper sichk...,0.0
2,berlin airbnb data airbnb successfully disrupt...,0.0
3,sarcastic comments reddit dataset contains 13 ...,0.0
4,boat types recognition dataset used blog post ...,0.0
5,mlb pitch data 20152018 pitchlevel data every ...,0.0
6,consumer visitor insights neighborhoods safegr...,0.0
7,cvpr 2019 papers cvpr http cvpr2019thecvfcom p...,0.0
8,department justice 20092018 press releases his...,0.0
9,department justice 20092018 press releases his...,0.0


In [92]:
kaggle_0s.to_csv('kaggle_0s.csv')

In [93]:
zindi_1s = zindi_training[['Text']]

In [94]:
zindi_1s['health_related'] = pd.Series(np.ones(len(zindi_training['Text'])))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [95]:
zindi_1s.head(10)

Unnamed: 0,Text,health_related
0,centers biomedical research excellence cobre p...,1.0
1,research regenerative medicine h2strongintrodu...,1.0
2,catholic health association india chai pthe ca...,1.0
3,quality improvement initiatives diabetes,1.0
4,provision thalassemia drugs disposables h2stro...,1.0
5,egypt country programme family planning 201820...,1.0
6,improving quantification forecasting new drugs...,1.0
7,call metrology emerging radiopharmaceuticals d...,1.0
8,funding stimulate clinical translational multi...,1.0
9,procurement radiopharmaceuticals treatment can...,1.0


In [96]:
zindi_1s.to_csv('zindi_1s.csv')

In [34]:
health_related_training = pd.concat([kaggle_0s, zindi_1s], axis=0)

In [35]:
health_related_training

Unnamed: 0,Text,health_related
0,Canadian Federal Election Results (Timeseries)...,0.0
1,Traffic Signs Preprocessed ### Related Paper\n...,0.0
2,Berlin Airbnb Data ### Context\n\nAirbnb has s...,0.0
3,Sarcastic Comments - REDDIT ***Context***\n\nT...,0.0
4,Boat types recognition ### Context\n\nThis dat...,0.0
...,...,...
2990,rats could help reduce global tb burden,1.0
2991,exploratory analyses adherence strategies data...,1.0
2992,study vaccines diarrhoeal diseases lower respi...,1.0
2993,regional engagement stimulation fund human imm...,1.0


In [122]:
#health_related_training.to_csv('first_combined_kagglezindi_training.csv', encoding='utf-8')