## Cleaning and aggregating Kaggle metadata for training

In the previous notebooks, we downloaded metadata about Kaggle's datasets and also cleaned and prepared training data from the Zindi competition related to SDG #3 (health).

For my final project, I wanted to be able to label SDG related datasets, and for the scope of the demo I focused on trying to label just SDG#3 related datasets given this was the only SDG I found specific training data for.

This notebook focuses on filtering the kaggle datasets I previously downloaded metadata for, and determining whether they are SDG#3 health related (1) or not (0). We filter based on the size of the dataset and usability rating (a statistic given by Kaggle). Later we determine which datasets are health related by determining the 20 most common related health keywords. 

All datasets created in this notebook come from the downloadedkaggle_datasetlists and kaggle_metadata folders which are included in the repo. Several intermediary datasets are saved and created, for clarity of the current dataset being worked on. Downloading the notebook and running the cells will create any of the intermi

Finally: This notebook is difficult to follow without knowing the full process/purpose of my project! You can hear why I have created so many different datasets here https://www.youtube.com/watch?v=pkMBp_l6Tgo&feature=youtu.be&t=758

Otherwise, enjoy the ride and hopefully there are a few techniques with pandas, dictionaries, and sets for filtering dataframes that someone finds useful!


## this notebook produces:
clean_FULL_kagglemetadata_health_related.to_csv('clean_description_FULLkagglemetadata_health_related.csv')
clean_kaggle_nonhealth_related.to_csv('clean_descriptions_kaggle_nonhealth_related.csv')
clean_kaggle_health_related.to_csv('clean_descriptions_kaggle_health_related.csv')

For the demo
data_to_display_1024 = pd.read_csv('data_to_display_filter.csv')

## loading the dataset list and metadata

Originally when downloading from the Kaggle API, recall we kept track of bad pages. The bad pages found we make a note of here.

pages 1-99 had bad_pages = [2, 3, 4, 5, 6, 7, 8, 120, 174, 194]

pages 200-399 and pages 400-500 had bad_pages= []

501 = "No datasets found"

In [1]:
#for vectorized operations
import numpy as np
#for dataset visualization and aggregation/organization
import pandas as pd
#for data cleaning and removing the html tags
import re


In [2]:
#read the dataset list CSVs
dataset_list_1_199 = pd.read_csv('kaggle_datasetlists/pages_1_199_datasetlist.csv')
dataset_list_200_399 = pd.read_csv('kaggle_datasetlists/pages_200_399_datasetlist.csv')
dataset_list_400_500 = pd.read_csv('kaggle_datasetlists/pages_400_500_datasetlist.csv')

In [3]:
#info displayed in the columns
print(dataset_list_1_199.columns, dataset_list_1_199.shape)
print(dataset_list_200_399.columns, dataset_list_200_399.shape)
print(dataset_list_400_500.columns, dataset_list_400_500.shape)

Index(['Unnamed: 0', 'ref', 'title', 'size', 'lastUpdated', 'downloadCount',
       'voteCount', 'usabilityRating'],
      dtype='object') (3780, 8)
Index(['Unnamed: 0', 'ref', 'title', 'size', 'lastUpdated', 'downloadCount',
       'voteCount', 'usabilityRating'],
      dtype='object') (4000, 8)
Index(['Unnamed: 0', 'ref', 'title', 'size', 'lastUpdated', 'downloadCount',
       'voteCount', 'usabilityRating'],
      dtype='object') (2020, 8)


In [4]:
dataset_list_1_199.head(2)

Unnamed: 0.1,Unnamed: 0,ref,title,size,lastUpdated,downloadCount,voteCount,usabilityRating
0,0,gustavomodelli/forest-fires-in-brazil,Forest Fires in Brazil,31KB,2019-08-24 16:09:16,18082,449,0.7647059
1,1,rajeevw/ufcdata,UFC-Fight historical data from 1993 to 2019,3MB,2019-07-05 09:58:02,12754,485,0.9705882


In [5]:
#read the metadata csvs
metadata_1_199 = pd.read_csv('kaggle_metadatasets/pages_1_199_metadata.csv')
metadata_200_399 = pd.read_csv('kaggle_metadatasets/pages_200_399_metadata.csv')
metadata_400_500 = pd.read_csv('kaggle_metadatasets/pages_400_500_metadata.csv')

In [6]:
#transpose to give the correct orientation (same as datasetlists)
metadata_1_199 = metadata_1_199.copy().T
metadata_200_399 = metadata_200_399.copy().T
metadata_400_500 = metadata_400_500.copy().T

In [7]:
print(metadata_1_199.columns, metadata_1_199.shape)
print(metadata_200_399.columns, metadata_200_399.shape)
print(metadata_400_500.columns, metadata_400_500.shape)

RangeIndex(start=0, stop=17, step=1) (3781, 17)
RangeIndex(start=0, stop=17, step=1) (4001, 17)
RangeIndex(start=0, stop=17, step=1) (2021, 17)


In [8]:
#the names of the columns
#we define them here and use them to set the column names in the next block
cols = metadata_1_199.iloc[0,:]

In [9]:
metadata_1_199.head

<bound method NDFrame.head of                                                             0       1  \
Unnamed: 0                                                 id   id_no   
0                       gustavomodelli/forest-fires-in-brazil  316056   
0.1                                           rajeevw/ufcdata  255092   
0.2             tristan581/17k-apple-app-store-strategy-games  318093   
0.3         chirin/africa-economic-banking-and-systemic-cr...  271144   
...                                                       ...     ...   
0.3775                   nodoubttome/skin-cancer9-classesisic  319080   
0.3776                         sivaram1987/google-stock-price  202531   
0.3777      suchith0312/pickledglove300d22mforkernelcompet...  192835   
0.3778      jjacostupa/condition-monitoring-of-hydraulic-s...  317375   
0.3779                                     tensor2flow/result  267184   

                    2                                                 3  \
Unnamed: 0  datase

In [10]:
#set the columns
metadata_1_199.columns = cols
metadata_200_399.columns = cols
metadata_400_500.columns = cols


In [11]:
#remove the redundant 0th row which repeats the columns
metadata_1_199 = metadata_1_199.iloc[1:]
metadata_200_399 = metadata_200_399.iloc[1:]
metadata_400_500 = metadata_400_500.iloc[1:]

In [12]:
metadata_200_399['data'][0]

"[{'description': None, 'name': 'Sin Wave Data Generator.csv', 'totalBytes': 66719, 'columns': [{'name': 'Wave', 'description': None, 'type': 'Uuid'}]}, {'description': None, 'name': 'Sin Wave Data Generator.xlsx', 'totalBytes': 293834, 'columns': []}]"

In [13]:
len(metadata_400_500['description'])

2020

In [14]:
#PREPARE FOR THE CONCATINATION 
metadatasets = [metadata_1_199, metadata_200_399, metadata_400_500]
datasetlists = [dataset_list_1_199, dataset_list_200_399, dataset_list_400_500]

In [15]:
#concat vertically 
metadata = pd.concat(metadatasets, axis=0, sort=True)
datasetlist = pd.concat(datasetlists, axis=0, sort=True)

In [16]:
len(metadata)

9800

In [128]:
metadata.head(5)

Unnamed: 0,collaborators,data,datasetId,datasetSlug,description,id,id_no,isPrivate,keywords,licenses,ownerUser,subtitle,title,totalDownloads,totalViews,totalVotes,usabilityRating
0.0,[],[{'description': 'This dataset report of the n...,316056,forest-fires-in-brazil,### Context\n\nForest fires are a serious prob...,gustavomodelli/forest-fires-in-brazil,316056,False,"['business', 'sensitive subjects', 'agricultur...",[{'name': 'copyright-authors'}],gustavomodelli,Number of forest fires reported in Brazil by S...,Forest Fires in Brazil,18103,123100,449,0.7647058823529411
0.1,[],[{'description': 'This is the partially proces...,255092,ufcdata,### Context\n\nThis is a list of every UFC fig...,rajeevw/ufcdata,255092,False,"['games and toys', 'sports', 'martial arts', '...",[{'name': 'CC0-1.0'}],rajeevw,"Compiled UFC fight, fighter stats and informat...",UFC-Fight historical data from 1993 to 2019,12759,83362,485,0.9705882352941176
0.2,[],[{'description': 'This csv file has 16 variabl...,318093,17k-apple-app-store-strategy-games,# Overview\nThe mobile games industry is worth...,tristan581/17k-apple-app-store-strategy-games,318093,False,"['video games', 'computing', 'internet', 'mobi...",[{'name': 'Attribution 4.0 International (CC B...,tristan581,Every strategy game on the Apple App Store,17K Mobile Strategy Games,14183,108452,514,0.9411764705882352
0.3,[],"[{'description': ""This dataset is a derivative...",271144,africa-economic-banking-and-systemic-crisis-data,### Context\n\nThis dataset is a derivative of...,chirin/africa-economic-banking-and-systemic-cr...,271144,False,"['africa', 'history', 'business', 'finance', '...",[{'name': 'copyright-authors'}],chirin,Data on Economic and Financial crises in 13 Af...,"Africa Economic, Banking and Systemic Crisis Data",6529,42038,207,1.0
0.4,[],"[{'description': 'There are 349,000 rows with ...",311454,border-crossing-entry-data,### Context\n\nThe Bureau of Transportation St...,akhilv11/border-crossing-entry-data,311454,False,"['statistics', 'natural and physical sciences'...",[{'name': 'U.S. Government Works'}],akhilv11,Inbound US border crossing entries,Border Crossing Entry Data,7201,50775,185,0.8235294117647058


In [129]:
#there are some null values initially in the metadata
metadata['description'].isna().value_counts()

True     5924
False    3876
Name: description, dtype: int64

In [130]:
len(metadata['title'])

9800

In [131]:
#information about the actual dataset, column headers
metadata['data'][0]

"[{'description': 'This dataset report of the number of forest fires in Brazil divided by states. The series comprises the period of approximately 10 years (1998 to 2017). ', 'name': 'amazon.csv', 'totalBytes': 260933, 'columns': [{'name': 'year', 'description': 'Year when Forest Fires happen', 'type': 'Uuid'}, {'name': 'state', 'description': 'Brazilian State ', 'type': 'String'}, {'name': 'month', 'description': 'Month when Forest Fires happen', 'type': 'String'}, {'name': 'number', 'description': 'Number of Forest Fires reported', 'type': 'Uuid'}, {'name': 'date', 'description': 'Date when Forest Fires where reported', 'type': 'DateTime'}]}]"

In [20]:
datasetlist.head(5)

Unnamed: 0.1,Unnamed: 0,downloadCount,lastUpdated,ref,size,title,usabilityRating,voteCount
0,0,18082,2019-08-24 16:09:16,gustavomodelli/forest-fires-in-brazil,31KB,Forest Fires in Brazil,0.7647059,449
1,1,12754,2019-07-05 09:58:02,rajeevw/ufcdata,3MB,UFC-Fight historical data from 1993 to 2019,0.9705882,485
2,2,14177,2019-08-26 08:22:16,tristan581/17k-apple-app-store-strategy-games,8MB,17K Mobile Strategy Games,0.9411765,514
3,3,6521,2019-07-21 02:00:17,chirin/africa-economic-banking-and-systemic-cr...,14KB,"Africa Economic, Banking and Systemic Crisis Data",1.0,207
4,4,7198,2019-08-21 14:51:34,akhilv11/border-crossing-entry-data,4MB,Border Crossing Entry Data,0.8235294,185


## Filtering usable datasets
Now that we have concatinated all of our dataset information into 2 dataframes (one containing all the dataset lists, the other containing all the metadata) we want to filter based on size and usability rating to ensure that the final tagged datasets are actually ML-friendly.

## usability rating > 5 and size > 10MB

We want to take the rows of the metadata where usability rating >= .5 and size > 10MB

## create a new column for size in the metadataset, copied from datasetlist

In [138]:
#make copies of metadata so we can freely alter columns
filter_metadata = metadata.copy()

#data_to_display is specifically created for the demo where we display the datasets with
#string sizekb
data_to_display = metadata.copy()

In [139]:
#create the size column in filter_metadata
filter_metadata['size'] = [datasetlist.iloc[row]['size'] for row in range(len(datasetlist['size']))]

In [141]:
#for data to display, we want to be able to filter the size but still keep the string KB
#information for display
data_to_display['size_kb'] = [row for row in datasetlist['size']]
data_to_display['size'] = [datasetlist.iloc[row]['size'] for row in range(len(datasetlist['size']))]

In [136]:
#filter_metadata now has size kb
filter_metadata.columns

Index(['collaborators', 'data', 'datasetId', 'datasetSlug', 'description',
       'id', 'id_no', 'isPrivate', 'keywords', 'licenses', 'ownerUser',
       'subtitle', 'title', 'totalDownloads', 'totalViews', 'totalVotes',
       'usabilityRating', 'size_kb'],
      dtype='object', name='Unnamed: 0')

In [142]:
#data_to_display now has size_kb which whill be left intact and size (which will be altered and
#used to filter)
data_to_display.columns

Index(['collaborators', 'data', 'datasetId', 'datasetSlug', 'description',
       'id', 'id_no', 'isPrivate', 'keywords', 'licenses', 'ownerUser',
       'subtitle', 'title', 'totalDownloads', 'totalViews', 'totalVotes',
       'usabilityRating', 'size_kb', 'size'],
      dtype='object', name='Unnamed: 0')

In [24]:
#back to filter_metadata, check the new size column is the same as the column copied from datasetlist
filter_metadata['size'].value_counts()

2MB      357
1MB      262
3MB      235
2GB      199
4MB      183
        ... 
902MB      1
842B       1
875KB      1
494MB      1
198KB      1
Name: size, Length: 1676, dtype: int64

In [25]:
datasetlist['size'].value_counts()

2MB      357
1MB      262
3MB      235
2GB      199
4MB      183
        ... 
902MB      1
842B       1
875KB      1
494MB      1
198KB      1
Name: size, Length: 1676, dtype: int64

## transform string sizes into numerical kilobytes
to actually be able to filter based on size, we have to transform the string KB information into one unit (bytes), transform to float, and filter based on values in that column

In [26]:
#testing string cases
string = '10KB'
string2 = '1GB'

In [27]:
#testing grabbing the value of interest out of the string
int(string2.split(string2[-2])[0])

1

In [28]:
#helper function that will take string KB/GB information and convert to bytes

def convert_to_KB(string):
    for character in string:
        #if second to last character is:
        if string[-2] == 'K':
            #return just the number at the front of the string
            return int(string.split(string[-2])[0])
        elif string[-2] == 'M':
            return 1024 * int(string.split(string[-2])[0])
        elif string[-2] == 'G':
            return 1048576 * int(string.split(string[-2])[0])

In [29]:
#array created to test convert_to_KB helper function
test_array = ['31KB', '3MB', '10MB']

In [30]:
KB_array = [convert_to_KB(string) for string in test_array]

In [31]:
KB_array

[31, 3072, 10240]

In [32]:
#now that we know the helper function works correctly, we can work it on the columns
filter_metadata['size'] = [convert_to_KB(row) for row in filter_metadata['size']]

In [143]:
#we do the same for the data_to_display dataframe
data_to_display['size'] = [convert_to_KB(row) for row in filter_metadata['size']]

In [144]:
data_to_display['size']

0           31.0
0.1       3072.0
0.2       8192.0
0.3         14.0
0.4       4096.0
           ...  
0.2015    3072.0
0.2016    2048.0
0.2017    3072.0
0.2018    3072.0
0.2019    3072.0
Name: size, Length: 9800, dtype: float64

In [33]:
filter_metadata['size']

0           31.0
0.1       3072.0
0.2       8192.0
0.3         14.0
0.4       4096.0
           ...  
0.2015    3072.0
0.2016    2048.0
0.2017    3072.0
0.2018    3072.0
0.2019    3072.0
Name: size, Length: 9800, dtype: float64

## change usability rating from string to float

In [152]:
#testing one row
float(filter_metadata['usabilityRating'][0])

0.7647058823529411

In [35]:
#apply to the whole usability rating column
filter_metadata['usabilityRating'] = [float(filter_metadata['usabilityRating'][row]) for row in range(len(filter_metadata['usabilityRating']))]

In [153]:
#the same is done for data_to_display
data_to_display['usabilityRating'] = [float(filter_metadata['usabilityRating'][row]) for row in range(len(filter_metadata['usabilityRating']))]

In [36]:
#check result
filter_metadata['usabilityRating']

0         0.764706
0.1       0.970588
0.2       0.941176
0.3       1.000000
0.4       0.823529
            ...   
0.2015    0.411765
0.2016    0.176471
0.2017    0.235294
0.2018    0.235294
0.2019    0.235294
Name: usabilityRating, Length: 9800, dtype: float64

## Actually applying the filter!
Now that our column types are numeric, we're able to apply our filter of usability rating > 5 and size > 10mb

In [154]:
#applying the filter to data_to_display, storing the result in data_to_display_filter
data_to_display_filter = data_to_display[
    (data_to_display['usabilityRating'] >= .6) & (data_to_display['size'] > 10240)]

In [155]:
#check that a filter was applied
len(data_to_display_filter['size_kb'])

1204

In [156]:
#data_to_diplay is now ready to go to CSV
data_to_display_filter.to_csv('data_to_display_filter.csv')

In [151]:
#apply to filter_metadata
apply_filter_metadata = filter_metadata[
    (filter_metadata['usabilityRating'] >= .6) & (filter_metadata['size'] > 10240)]

TypeError: '>=' not supported between instances of 'str' and 'float'

## Assessment of what has been done so far
data_to_display was needed specifically for the demo, and has been filtered and stored in csv for that purpose.
For the actual machine learning and NLP, the rest of the notebook will be working with apply_filter_metadata. We want to apply NLP to the description and title texts to determine if a sample is health related or not. The next steps include:

1. determining if there are null values in the apply_filter_metadata in the description column
2. using the keywords column to determine if any Kaggle set is health related (1) or not (0)

In [38]:
#check for null values in the description column
apply_filter_metadata['description'].isna().value_counts()

False    1193
True       11
Name: description, dtype: int64

In [39]:
#store the indices rows where the description is null
indices = np.where(apply_filter_metadata['description'].isna() == True)

In [40]:
indices

(array([ 765,  785,  812,  859,  961,  994, 1009, 1115, 1136, 1179, 1196]),)

In [41]:
#check the title of the rows lacking description text
apply_filter_metadata['title'][[ 765,  785,  812,  859,  961,  994, 1009, 1115, 1136, 1179, 1196]]

0.2580    All yearly tables for big soccer leagues in Eu...
0.2861                                              LANL_FT
0.3110                                  Oxford IIIT dataset
0.3547                                       H1B Prediction
0.851                           Polish politicians speeches
0.1679                  Topic TSV for Freebase Common Dump 
0.1871    Last Week Tonight Zebra Video, Images, and Models
0.317     Last Week Tonight Zebra Video, Images, and Models
0.654                                        Wiki_portugues
0.1581    Last Week Tonight Zebra Video, Images, and Models
0.1895         Political conflicts in Africa from 1997-2018
Name: title, dtype: object

In [42]:
len(apply_filter_metadata)

1204

In [43]:
#Since all samples have some text to draw on (rows lacking description did have a text title)
#we can save to csv
apply_filter_metadata.to_csv('applied_filter_metadata.csv', encoding='utf-8')

## Determining the most common keywords

We are going to use each samples keywords as the main method for determining if a dataset is health related or not

In [157]:
#read the dataframes in fresh
applied_filter_metadata = pd.read_csv('applied_filter_metadata.csv', encoding='utf-8')
data_to_display_1024 = pd.read_csv('data_to_display_filter.csv')

In [45]:
#take the values of the keywords column, results in an array
keywords_array = applied_filter_metadata['keywords'].values

#keywords_array is an array containing a list, where each element is a string containing multiple keywords
#separated by commas
keywords_array

array(["['health', 'computing', 'internet', 'cooking and recipes', 'food', 'online communities', 'social networks']",
       "['earth sciences', 'weather', 'health']",
       "['north america', 'time series', 'government', 'politics']", ...,
       "['categorical data', 'computing', 'data cleaning', 'text data', 'binary classification']",
       "['brazil', 'money', 'image data', 'online image galleries', 'photo and video services', 'online communities']",
       "['literature', 'nlp', 'tabular data', 'image data']"],
      dtype=object)

In [158]:
#helper functions to clean the key_word array

def keyword_cleaner(keywords):
    cleantext = keywords.replace('[', '').replace(']', '').replace('\'', '').replace(", ", ',')
    return cleantext

def cleanjson(weird_json):
    less_weird_json = weird_json.replace('\n', ' ').replace('#', '')
    cleaner = re.compile('<.*?>|&([a-z09]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    #cleaner = re.compile('<.*?>')
    cleantext = re.sub(cleaner, '',less_weird_json)
    return cleantext



In [159]:
#test on one row
keyword_cleaner(keywords_array[0])

'health,computing,internet,cooking and recipes,food,online communities,social networks'

In [160]:
#apply the cleaner to the entire keywords array, store in clean_keywords_array
clean_keywords_array = [keyword_cleaner(string) for string in keywords_array]

In [161]:
#clean_keywords_array is an array where each element is a string of keywords
clean_keywords_array

['health,computing,internet,cooking and recipes,food,online communities,social networks',
 'earth sciences,weather,health',
 'north america,time series,government,politics',
 'law,computer science,classification,cnn,image data,multiclass classification,pattern recognition,business operations,vehicle codes and driving laws',
 'europe,leisure,hotels,hotels and accommodations,vacation rentals and short-term stays',
 'humor,literature,psychology,reddit,internet,funny pictures and videos,reference',
 'photography,boating,classification,image data,multiclass classification,classifieds,boats and watercraft',
 'baseball,sports,sports news',
 'business,demographics,computing,road transport,reference,business services',
 'academics,natural and physical sciences,education,computing,computer science,text data',
 'natural and physical sciences,crime,politics,nlp,text data,law and government',
 'natural and physical sciences,crime,politics,nlp,text data,law and government',
 'business,internet,mobil

In [162]:
#testing the split, applying it like this will turn each string element of the 
#clean_keywords_array into a list where each element is one string keyword
clean_keywords_array[0].split(",")

['health',
 'computing',
 'internet',
 'cooking and recipes',
 'food',
 'online communities',
 'social networks']

In [163]:
#define an empty dictionary
keyword_count = {}

#now we are going to cycle through clean_keywords_array, and fill a dictionary keyword_tokens
#with keys being each keyword, and values being each time that keyword appeared
for string in clean_keywords_array:
    keyword_tokens = string.split(",")
    for word in keyword_tokens:
        if word in keyword_count:
            keyword_count[word] += 1
        else:
            keyword_count[word] = 1

In [164]:
#to inspect the results, we want to sort the dictionary by the keyword counts (values)
import operator
sorted_keyword_count = sorted(keyword_count.items(),key=operator.itemgetter(1),reverse=True)
sorted_keyword_count

[('business', 209),
 ('natural and physical sciences', 209),
 ('computer science', 195),
 ('image data', 193),
 ('computing', 190),
 ('arts and entertainment', 182),
 ('linguistics', 139),
 ('internet', 120),
 ('reference', 99),
 ('classification', 87),
 ('society', 86),
 ('finance', 78),
 ('online communities', 73),
 ('education', 72),
 ('leisure', 71),
 ('news', 68),
 ('biology', 63),
 ('languages', 63),
 ('health', 59),
 ('nlp', 59),
 ('software', 59),
 ('multiclass classification', 58),
 ('time series', 54),
 ('games and toys', 54),
 ('language resources', 52),
 ('image processing', 52),
 ('politics', 51),
 ('deep learning', 51),
 ('crime', 50),
 ('online image galleries', 49),
 ('social sciences', 48),
 ('online media', 47),
 ('sports', 45),
 ('text data', 45),
 ('music', 45),
 ('programming', 42),
 ('mathematics', 42),
 ('demographics', 41),
 ('law and government', 40),
 ('healthcare', 40),
 ('investing', 39),
 ('economics', 34),
 ('video games', 34),
 ('renewable energy', 34),
 

## create a set of the most frequent keywords

In [165]:
#this creates a new dictionary from keyword_count dictionary which only includes
#the keywords that appear more than 10 times
frequent_keywords_dictionary = dict((k, v) for k, v in keyword_count.items() if v >= 10)

In [166]:
#creating a set, which will be used in the following cells
frequent_keywords = set(frequent_keywords_dictionary.keys())

## Determining the most common health related keywords from the most common keywords
For the purposes of the project, this is where the AI needs a bit of human intervention.

I can look at the dictionary of frequent keywords, and from that set determine from my own judgement which are health related. I have done this and created a set of the 20 most common health related keywords. I cross-checked this by going to Kaggle.com and using their filter to see health related datasets, and other common keywords that occur alongside the keyword "health"

health_keywords = set('health', healthcare, health conditions, nutrition, drugs and medications, diabetes, endocrine conditions, health conditions, heart and hypertension, biotechnology, public health, mental health, cancer, womens health, diseases, 
epidemiology, oncology and cancer, biology)

from searching kaggle: filter health, >10MB

In [167]:
health_keywords = set(['health', 'healthcare', 'health conditions', 'nutrition', 'drugs and medications',
                      'diabetes', 'endocrine conditions', 'health conditions', 'heart and hypertension', 
                      'biotechnology', 'public health', 'mental health', 'cancer', 'womens health', 
                      'diseases', 'epidemiology', 'oncology and cancer', 'biology', 'medicine',
                      'reproductive health', 'exercise','medical procedures', 'self care' ])

                    

In [168]:
print(frequent_keywords)
print(health_keywords)


{'cities', 'linguistics', 'animals', 'video games', 'india', 'social issues and advocacy', 'product reviews and price comparisons', 'europe', 'self care', 'medical procedures', 'social networks', 'demographics', 'twitter', 'stocks and bonds', 'social sciences', 'public health', 'text data', 'united states', 'board games', 'primary and secondary schooling (k-12)', 'shopping', 'writing', 'multimedia software', 'religion and belief systems', 'crime', 'classification', 'foreign language resources', 'time series', 'leisure', 'image data', 'financial markets news', 'mathematics', 'sport cycling', 'maps', 'computing', 'psychology', 'politics', 'news agencies', 'data management', 'feature engineering', 'games and toys', 'time series analysis', 'business services', 'cnn', 'healthcare', 'events and listings', 'online image galleries', 'exercise', 'economics', 'geography', 'team sports', 'health conditions', 'energy', 'lstm', 'reddit', 'research', 'money', 'earth sciences', 'investing', 'language

## what is the difference of these 2 sets?

Above we created a set from our dictionary of the most common keywords. Sets are useful for seeing the overlap (intersection) and difference between two groups of variables. From our list of frequent keywords, we want to know which keywords never occur alongside a health related keyword. From this process we can confidently tag some datasets as being non-health related (0)

In [169]:
#most frequent keywords that have nothing to do with health, 
#so datasets with these tags we can say are 0
most_frequent_keywords_excludinghealth = frequent_keywords.difference(health_keywords)

In [170]:
most_frequent_keywords_excludinghealth 

{'acoustics',
 'agriculture',
 'animals',
 'artificial intelligence',
 'arts and entertainment',
 'association football',
 'astronomy',
 'automobiles',
 'binary classification',
 'board games',
 'brazil',
 'business',
 'business services',
 'categorical data',
 'chemistry',
 'cities',
 'classification',
 'climate',
 'cnn',
 'computer science',
 'computing',
 'crime',
 'currencies and foreign exchange',
 'cycling',
 'data management',
 'data visualization',
 'databases',
 'deep learning',
 'demographics',
 'dictionaries and encyclopedias',
 'dogs',
 'earth sciences',
 'economics',
 'education',
 'email and messaging',
 'energy',
 'europe',
 'events and listings',
 'feature engineering',
 'film',
 'finance',
 'financial markets news',
 'food and drink',
 'foreign language resources',
 'games and toys',
 'geographic reference',
 'geography',
 'government',
 'historiography',
 'history',
 'home',
 'image data',
 'image processing',
 'india',
 'internet',
 'investing',
 'journalism',
 'lang

## One hot encoding the keywords

We now have a few sets of keywords: frequent_keywords, health_keywords, and most_frequent_keywords_excludinghealth. How are we going to use these sets to label each row of our kaggle metadata? 

1. one hot encode each keyword, so adding a binary column 0 or 1 for each keyword to our metadata dataset 
2. cycle through each row and determine if there is a 1 in any keyword column that is health related
3. create a new column, "health_related" which will be 0 or 1, depending if any of the keyword columns for that row were in the health_related_keyword set or not. 

In [1]:
#read the filtered metadata fresh
applied_filter_metadata = pd.read_csv('applied_filter_metadata.csv', encoding='utf-8')

NameError: name 'pd' is not defined

In [60]:
#inspect the original columns
applied_filter_metadata.columns

Index(['Unnamed: 0', 'collaborators', 'data', 'datasetId', 'datasetSlug',
       'description', 'id', 'id_no', 'isPrivate', 'keywords', 'licenses',
       'ownerUser', 'subtitle', 'title', 'totalDownloads', 'totalViews',
       'totalVotes', 'usabilityRating', 'size'],
      dtype='object')

In [61]:
#inspect the original datatype of the keywords column, a string 
applied_filter_metadata.keywords[0]

"['health', 'computing', 'internet', 'cooking and recipes', 'food', 'online communities', 'social networks']"

In [62]:
#same as before, no null entries created by saving and loading csv
applied_filter_metadata['description'].isna().value_counts()

False    1193
True       11
Name: description, dtype: int64

In [171]:
#create a new dataframe where we will add all the new one hot encoded keyword columns
onehotencoded_keywords_df = applied_filter_metadata.copy()

In [172]:
#fill with 0s to begin, test
test_series = pd.Series(np.zeros(len(onehotencoded_keywords_df)))

In [173]:
test_series

0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
1199    0.0
1200    0.0
1201    0.0
1202    0.0
1203    0.0
Length: 1204, dtype: float64

In [174]:
#populating the data_to_display_1024 for the demo

for row in range(len(data_to_display_1024)):
    kw_list = keyword_cleaner(data_to_display_1024['keywords'][row])
    kw_token_list = kw_list.split(",")
    for kw_token in kw_token_list:
        if kw_token not in data_to_display_1024.columns:
            data_to_display_1024[kw_token] = pd.Series(np.zeros(len(onehotencoded_keywords_df)))
            data_to_display_1024[kw_token][row] = 1.0
        else:
            data_to_display_1024[kw_token][row] = 1.0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [66]:
#applying one hot encoding to onehotencoded_keywords_df which is used to the ML and NLP

#for every row in the dataframe
for row in range(len(onehotencoded_keywords_df)):
    #create a list of the keywords for that row
    kw_list = keyword_cleaner(onehotencoded_keywords_df['keywords'][row])
    kw_token_list = kw_list.split(",")
    #for every keyword in the list
    for kw_token in kw_token_list:
        #if that keyword hasn't yet been added to the columns, add it and mark as present
        if kw_token not in onehotencoded_keywords_df.columns:
            onehotencoded_keywords_df[kw_token] = pd.Series(np.zeros(len(onehotencoded_keywords_df)))
            onehotencoded_keywords_df[kw_token][row] = 1.0
        #mark as present
        else:
            onehotencoded_keywords_df[kw_token][row] = 1.0
    
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [67]:
#checking the resulting dataframe columns
for col in onehotencoded_keywords_df.columns:
    print(col)

Unnamed: 0
collaborators
data
datasetId
datasetSlug
description
id
id_no
isPrivate
keywords
licenses
ownerUser
subtitle
title
totalDownloads
totalViews
totalVotes
usabilityRating
size
health
computing
internet
cooking and recipes
food
online communities
social networks
earth sciences
weather
north america
time series
government
politics
law
computer science
classification
cnn
image data
multiclass classification
pattern recognition
business operations
vehicle codes and driving laws
europe
leisure
hotels
hotels and accommodations
vacation rentals and short-term stays
humor
literature
psychology
reddit
funny pictures and videos
reference
photography
boating
classifieds
boats and watercraft
baseball
sports
sports news
business
demographics
road transport
business services
academics
natural and physical sciences
education
text data
crime
nlp
law and government
mobile web
web services
mobile apps and add-ons
biology
medicine
online image galleries
healthcare
public health
society
sound tech

In [68]:
#still 11 nans
onehotencoded_keywords_df['description'].isna().value_counts()

False    1193
True       11
Name: description, dtype: int64

## we now have a new dataframe, onehotencoded_keywords_df which contains new columns (one for each keyword) populated with 1 or 0 depending if the keyword is present in the keywords for that row or not

## Next steps

For the purposes of the project, we are going to use the keyword information to create several different training datasets. These resulting datasets will go on to be the training sets as we try different configurations of text samples from Kaggle/Zindi inorder to see which combination of training data allows the NLP model to work the best. 

First we will use the keyword column information create one new column: health related, filled with 0 or 1 depending if the sample is health related or not.

We have also previously applied some filters above based on usability and size. To gather more text information for our model, we wil also be reloading all the original metadata (without having applied the filter) to get more training samples of health related and non health related descriptions.

so to start, let's load the original metadata

In [69]:
#create a working copy
metadata_copy = metadata.copy()

In [70]:
#check the keywords column
metadata_copy['keywords'][0]

"['business', 'sensitive subjects', 'agriculture and forestry', 'fire and security services']"

In [71]:
#we can onehot encode all the keywords again as above to the original metadata

for row in range(len(metadata_copy)):
    kw_list = keyword_cleaner(metadata_copy['keywords'][row])
    kw_token_list = kw_list.split(",")
    for kw_token in kw_token_list:
        if kw_token not in metadata_copy.columns:
            metadata_copy[kw_token] = pd.Series(np.zeros(len(metadata_copy)))
            metadata_copy[kw_token][row] = 1.0
        else:
            metadata_copy[kw_token][row] = 1.0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [72]:
#check the columns have been created
[print(col) for col in metadata_copy.columns]

collaborators
data
datasetId
datasetSlug
description
id
id_no
isPrivate
keywords
licenses
ownerUser
subtitle
title
totalDownloads
totalViews
totalVotes
usabilityRating
business
sensitive subjects
agriculture and forestry
fire and security services
games and toys
sports
martial arts
sports news
video games
computing
internet
mobile web
strategy games
business services
mobile apps and add-ons
africa
history
finance
banking
economics
reference
business news
statistics
natural and physical sciences
transport
automobiles
freight and trucking
india
faith and traditions
christianity
linguistics
foreign language resources
lgbt
society
survey analysis
health
cooking and recipes
food
online communities
social networks
europe
social sciences
psychology
social issues and advocacy
consumer electronics
consumer resources
product reviews and price comparisons
mobile phones
alcohol
tabular data
restaurants
storytelling
atmospheric sciences
eda
data visualization
climate change and global warming
green

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

## now we want to collapse the health related keywords into one column, just health related or not. we have our set of health related keywords


In [180]:
health_keywords = set(['health', 'healthcare', 'health conditions', 'nutrition', 'drugs and medications',
                      'diabetes', 'endocrine conditions', 'health conditions', 'heart and hypertension', 
                      'biotechnology', 'public health', 'mental health', 'cancer', 'womens health', 
                      'diseases', 'oncology and cancer', 'biology', 'medicine',
                      'reproductive health', 'exercise','medical procedures', 'self care' ])


In [175]:
#continuing parallel is our data_to_display dataset which is used for the demo
collapse_data_to_display = data_to_display_1024.copy()

In [74]:
#creating a health_related collapsed df for the filtered kaggle data
collapse_health_df = onehotencoded_keywords_df.copy()

In [75]:
#creating a health_related collapsed df for WHOLE metadataset
collapse_health_metadata = metadata_copy.copy()

In [178]:
#add a column populated with 0s for health related
collapse_data_to_display['health_related'] = pd.Series(np.zeros(len(collapse_data_to_display)))

In [181]:
#filling up the health_related column for the filtered kaggle datasets
for row in range(len(collapse_data_to_display)):
    for keyword in health_keywords:
        if collapse_data_to_display[keyword][row] == 1.0:
            collapse_data_to_display['health_related'][row] = 1.0
        #else i already initialized the health_related column with 0
        else:
            pass

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [76]:
#add a column to collapse the health keywords, 'health_related'
collapse_health_df['health_related'] = pd.Series(np.zeros(len(collapse_health_df)))

In [77]:
#adding a column to a copy of the WHOLE metadataset
collapse_health_metadata['health_related'] = pd.Series(np.zeros(len(collapse_health_metadata)))

In [78]:
#filling up the health_related column for the filtered kaggle datasets
for row in range(len(collapse_health_df)):
    for keyword in health_keywords:
        if collapse_health_df[keyword][row] == 1.0:
            collapse_health_df['health_related'][row] = 1.0
        #else i already initialized the health_related column with 0
        else:
            pass
            
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [79]:
#filling up the health_related column for the WHOLE METADATASET
for row in range(len(collapse_health_metadata)):
    for keyword in health_keywords:
        if collapse_health_metadata[keyword][row] == 1.0:
            collapse_health_metadata['health_related'][row] = 1.0
        #else i already initialized the health_related column with 0
        else:
            pass

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [80]:
#These are 654 health related datasets in the (filtered) scraped kaggle dataset
collapse_health_metadata['health_related'].value_counts()

1.0    654
Name: health_related, dtype: int64

In [183]:
#indices of health related datasets for the data_to_display for demo
collapse_data_to_display_indices = np.where(collapse_data_to_display['health_related'] == 1.0)

In [185]:
#creating the final data_to_display for the demo, which has been filtered for size and usability
#and containing only the indices where health_related is 1
data_to_display_final = collapse_data_to_display.iloc[list(collapse_data_to_display_indices[0]), :]
collapse_data_to_display['health_related'].value_counts()

0.0    1046
1.0     158
Name: health_related, dtype: int64

In [188]:
#save the final
collapse_data_to_display.to_csv('data_to_display_final.csv')

In [81]:
#take the indices of the metadataset where health_related is 1
FULL_metadata_healthrelated_indices = np.where(collapse_health_metadata['health_related'] == 1.0)

In [82]:
#new DF containing all the rows where health_related was 1, on the FULL METADATASET
FULL_kagglemetadata_health_related = collapse_health_metadata.iloc[list(FULL_metadata_healthrelated_indices[0]), :]

In [83]:
#there 654 total health related datasets in the full metadata from kaggle
FULL_kagglemetadata_health_related['health_related'].value_counts()

1.0    654
Name: health_related, dtype: int64

In [84]:
#save to csv in the same title format as below
FULL_kagglemetadata_health_related.to_csv('FULL_kagglemetadata_health_related.csv', encoding='utf-8')

In [85]:
collapse_health_df['health_related'].value_counts()

0.0    1046
1.0     158
Name: health_related, dtype: int64

In [86]:
#still only 11 nans
collapse_health_df['description'].isna().value_counts()

False    1193
True       11
Name: description, dtype: int64

## now we just need the index location of all the rows that are not health related!!


In [87]:
not_health_related_indices = np.where(collapse_health_df['health_related'] == 0.0)

In [88]:
#new dataset containing all the rows where health_related was 0, with all the original columns
kaggle_metadata_not_health_related_df = collapse_health_df.iloc[list(not_health_related_indices[0]), :]

In [91]:
kaggle_metadata_not_health_related_df['health_related'].value_counts()

0.0    1046
Name: health_related, dtype: int64

## We will also need a df with the kaggle examples that are health-related, for testing!! below contains only the health related examples that were left after applying the filter usability>.6 & size >10mb

In [92]:
health_related_indices = np.where(collapse_health_df['health_related'] == 1.0)
#new dataset containing all the rows where health_related was 1, with all the original columns
kaggle_metadata_health_related_df = collapse_health_df.iloc[list(health_related_indices[0]), :]
kaggle_metadata_health_related_df['health_related'].value_counts()

1.0    158
Name: health_related, dtype: int64

In [93]:
#still 11 null values
kaggle_metadata_not_health_related_df['description'].isna().value_counts()

False    1035
True       11
Name: description, dtype: int64

In [94]:
kaggle_metadata_not_health_related_df.to_csv('kaggle_not_health_related.csv', encoding='utf-8')
kaggle_metadata_health_related_df.to_csv('kaggle_health_related.csv', encoding='utf-8')

## Assessment: What has been done so far
The previous cells have yeilded dataframes (saved to csvs) with different filters having been applied.

For the NLP model, we experimented feeding it different combinations of data which come from the dataframes above.

Essentially, to determine whether a dataset is health related or not based on it's title+description, we need a training set to give to an NLP model composed of the zindi training data all health related (which was all contained in a previous notebook) and some descriptions which are not health related (kaggle)

To summarize the findings of the project and clarify what we have done in this notebook-> We first tried a training dataset composed of zindi health related (1s) and kaggle-non health related (0s) filtered by usability and size. 

Next we returned to this notebook inorder to build more training data, we also sought to create a dataframe/csv of all the kaggle examples that WERE health related (1s) and added that to our zindi health related (1s) inorder to address some bias in the training data from the first attempt (where all the health related 1s came from various media sources, and all the kaggle 0s came from dataset descriptions). 

Finally we applied the same process for all the metadata (not filtered by usability and size) just to gather more text information to train the NLP model on. 


## Final steps, cleaning and preparing the title+description text for NLP

title, and the description CONTEXT ONLY, which for most samples contains the background information about the dataset, and removing the word "context"

In [95]:
#reloading the csvs fresh
clean_kaggle_nonhealth_related = pd.read_csv("kaggle_not_health_related.csv", encoding='utf-8')
clean_kaggle_health_related = pd.read_csv('kaggle_health_related.csv', encoding='utf-8')

clean_FULL_kagglemetadata_health_related = pd.read_csv('FULL_kagglemetadata_health_related.csv', encoding='utf-8')

In [96]:
#still 11 nans
clean_kaggle_nonhealth_related['description'].isna().value_counts()

False    1035
True       11
Name: description, dtype: int64

In [97]:
#inspecting the description column you can see all descriptions begin with context
clean_kaggle_nonhealth_related['description']

0       ### Context\nCanadians elect representatives t...
1       ### Related Paper\nSichkar V. N., Kolyubin S. ...
2       ### Context\n\nAirbnb has successfully disrupt...
3       ***Context***\n\nThis dataset contains 1.3 mil...
4       ### Context\n\nThis dataset is used on this bl...
                              ...                        
1041    # Context \r\n\r\nThere's a story behind every...
1042    The United State Federal Highway Administratio...
1043    ### Context\n\nThis works focuses upon creatin...
1044    ### Context\n\nDataset com fotos tiradas com a...
1045    ### Context\nThis is an attempt to figure out ...
Name: description, Length: 1046, dtype: object

In [98]:
#testing splitting the string descriptions based on context
test1 = clean_kaggle_nonhealth_related['description'][0]
test2 = clean_kaggle_nonhealth_related['description'][5]
test3 = clean_kaggle_nonhealth_related['description'][3]

In [99]:
test1.split('### Content')[0]

'### Context\nCanadians elect representatives to the House of Commons. The leader of the party who has the confidence of a majority of members of the House forms the Government. This data explores the results of elections over the past couple of decades. \n\n\n\n'

In [100]:
test2.split('### Content')[0]

"Pitch-level data for every pitch thrown during the 2015-2018 MLB regular seasons. Data scraped from http://gd2.mlb.com/components/game/mlb/. Each row represents a single pitch.\n\nThe data doesn't come with clear definitions (that I can find, at least). Here's what I believe the codes mean:\n\n# Pitch Type Definitions #\n\nCH - Changeup\n\nCU - Curveball\n\nEP - Eephus*\n\nFC - Cutter\n\nFF - Four-seam Fastball\n\nFO - Pitchout (also PO)*\n\nFS - Splitter\n\nFT - Two-seam Fastball\n\nIN - Intentional ball\n\nKC - Knuckle curve\n\nKN - Knuckeball\n\nPO - Pitchout (also FO)*\n\nSC - Screwball*\n\nSI - Sinker\n\nSL - Slider\n\nUN - Unknown*\n\n* these pitch types occur rarely\n\n# Code Definitions #  \n\nWhile these aren't spelled out anywhere, play descriptions allowed confident identification of these codes\n\nB - Ball\n\n\\*B - Ball in dirt\n\nS - Swinging Strike\n\nC - Called Strike\n\nF - Foul\n\nT - Foul Tip\n\nL - Foul Bunt\n\nI - Intentional Ball\n\nW - Swinging Strike (Blocked)\

In [101]:
#preparing a description cleaner 

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]#*')
BAD_SYMBOLS_RE = re.compile('[^0123456789a-z +_]')
def description_cleaner(description):
    text = str(description)
    text = text.lower()# lowercase text
    #try to split the string on "###content", take the first half of string (###context)
    text = text.split("content")[0]
    text = text.replace('\n', " ")
    text = text.replace("context", '')
    text = re.sub(REPLACE_BY_SPACE_RE, ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, '', text)# delete symbols which are in BAD_SYMBOLS_RE from text
    text = [word for word in text.split() if word not in STOPWORDS]# delete stopwords from text
    text = " ".join(text)
    return text
    

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/annachesson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [102]:
description_cleaner(test3)

'dataset contains 13 million sarcastic comments internet commentary website reddit dataset generated scraping comments reddit containing sarcasm tag tag often used redditors indicate comment jest meant taken seriously generally reliable indicator sarcastic comment'

In [103]:
#clean the descriptions and the titles
clean_kaggle_nonhealth_related['description'] = [description_cleaner(row) for row in clean_kaggle_nonhealth_related['description']]
clean_kaggle_nonhealth_related['title'] = [description_cleaner(row) for row in clean_kaggle_nonhealth_related['title']]

In [104]:
#check the result
clean_kaggle_nonhealth_related['description'][0]

'canadians elect representatives house commons leader party confidence majority members house forms government data explores results elections past couple decades'

In [105]:
#clean the descriptions and titles for the health_related df as well
clean_kaggle_health_related['description'] = [description_cleaner(row) for row in clean_kaggle_health_related['description']]
clean_kaggle_health_related['title'] = [description_cleaner(row) for row in clean_kaggle_health_related['title']]

In [106]:
len(clean_kaggle_health_related)

158

In [107]:
#clean the descriptions and titles for the health_related FULL METADATA as well
clean_FULL_kagglemetadata_health_related['description'] = [description_cleaner(row) for row in clean_FULL_kagglemetadata_health_related['description']]
clean_FULL_kagglemetadata_health_related['title'] = [description_cleaner(row) for row in clean_FULL_kagglemetadata_health_related['title']]

In [108]:
clean_FULL_kagglemetadata_health_related['description'].isna().value_counts()

False    654
Name: description, dtype: int64

In [109]:
clean_kaggle_nonhealth_related['description'].isna().value_counts()

False    1046
Name: description, dtype: int64

In [110]:
clean_kaggle_nonhealth_related['title'].isna().value_counts()

False    1046
Name: title, dtype: int64

In [111]:
clean_kaggle_nonhealth_related['description'].head(10)

0    canadians elect representatives house commons ...
1    related paper sichkar v n kolyubin effect vari...
2    airbnb successfully disrupted traditional hosp...
3    dataset contains 13 million sarcastic comments...
4    dataset used blog post https clorichelcom blog...
5    pitchlevel data every pitch thrown 20152018 ml...
6    safegraph safegraph democratizing access data ...
7    cvpr http cvpr2019thecvfcom premier annual com...
8    historical dataset containing 13 087 press rel...
9    historical dataset containing 13 087 press rel...
Name: description, dtype: object

In [112]:
clean_FULL_kagglemetadata_health_related.to_csv('clean_description_FULLkagglemetadata_health_related.csv')
clean_kaggle_nonhealth_related.to_csv('clean_descriptions_kaggle_nonhealth_related.csv')
clean_kaggle_health_related.to_csv('clean_descriptions_kaggle_health_related.csv')

## Conclusions

Obviously, this notebook is difficult to interpret without knowing the full scope of my project. In essence this notebook can be understood as the most important "working" notebook of my project, where I returned to create new datasets which required the same filtering and preprocessing with nltk and pandas.



## One strange observation
number of null values in ['description'] changes from 11 to 39 JUST BY SAVING AND LOADING?

In [158]:
df = pd.read_csv('clean_descriptions_kaggle_nonhealth_related.csv')

In [159]:
df['description'].isna().value_counts()

False    1007
True       39
Name: description, dtype: int64