# Cleaning the show list

Before I begin requesting information from the APIs, I'll want to clean the list that I have to be able to run it through the APIs first. 

In terms of the cleaning that's being performed, I am removing:
- the leading url portion (that I had added in my first function...)
- wikipedia formatting/categorizing links that were caught in my search
- special characters and broken unicode

In [1]:
import pickle
import json
import re

## Loading in the list and cleaning the show names

In [2]:
with open ('./Assets_&_Data/amc_links.pickle', 'rb') as fp:
    amc_links = pickle.load(fp)
    
with open ('./Assets_&_Data/as_links.pickle', 'rb') as fp:
    as_links = pickle.load(fp)
    
with open ('./Assets_&_Data/cbs_links.pickle', 'rb') as fp:
    cbs_links = pickle.load(fp)
    
with open ('./Assets_&_Data/cc_links.pickle', 'rb') as fp:
    cc_links = pickle.load(fp)
    
with open ('./Assets_&_Data/cw_links.pickle', 'rb') as fp:
    cw_links = pickle.load(fp)
    
with open ('./Assets_&_Data/disney_links.pickle', 'rb') as fp:
    disney_links = pickle.load(fp)
    
with open ('./Assets_&_Data/fox_links.pickle', 'rb') as fp:
    fox_links = pickle.load(fp)
    
with open ('./Assets_&_Data/hbo_links.pickle', 'rb') as fp:
    hbo_links = pickle.load(fp)
    
with open ('./Assets_&_Data/nbc_links.pickle', 'rb') as fp:
    nbc_links = pickle.load(fp)
    
with open ('./Assets_&_Data/syfy_links.pickle', 'rb') as fp:
    syfy_links = pickle.load(fp)
    
with open ('./Assets_&_Data/abc_links.pickle', 'rb') as fp:
    abc_links = pickle.load(fp)

## Cleaning links to be strings of show names

In [3]:
abc_links[:5]

['https://en.wikipedia.org#Drama_series',
 'https://en.wikipedia.org#Science-fiction_series',
 'https://en.wikipedia.org#Westerns',
 'https://en.wikipedia.org#Game_shows_2',
 'https://en.wikipedia.org#Miniseries']

In [4]:
pattern = re.compile('https://en.wikipedia.org/wiki/(.*)')

In [5]:
def remove_wiki(link_list):
    cleaned_links = [re.findall(pattern, x)[0] for x in link_list if re.findall(pattern, x)]
    return cleaned_links


In [6]:
show_list = remove_wiki(hbo_links) + remove_wiki(nbc_links) +   \
            remove_wiki(amc_links) + remove_wiki(as_links) +    \
            remove_wiki(cbs_links) + remove_wiki(cc_links) +    \
            remove_wiki(cw_links) + remove_wiki(disney_links) + \
            remove_wiki(syfy_links) + remove_wiki(fox_links) + remove_wiki(abc_links)

In [7]:
len(show_list)

9822

In [8]:
show_list = [x for x in show_list if not x.startswith('Template') 
         and not x.startswith('List of ') 
         and not x.startswith("Special:")
         and not x.startswith("Wikipedia:")
         and not x.startswith("Portal")
         and not x.startswith("IMDB")
         and not x.startswith("Talk:")
         and not x.startswith("Help:")
         and not x.startswith("Category")]

In [9]:
len(show_list)

9528

In [10]:
show_list[:10]

['Game_of_Thrones',
 'Westworld_(TV_series)',
 'Big_Little_Lies_(TV_series)',
 'The_Deuce_(TV_series)',
 'Succession_(TV_series)',
 'Curb_Your_Enthusiasm',
 'Veep',
 'Silicon_Valley_(TV_series)',
 'Ballers',
 'High_Maintenance']

In [11]:
def clean_string(string):
    string = re.sub('_', ' ', string)           # Replaces underscores with spaces
    string = re.sub('%27', '\'', string)        # Replaces apostrophes
    string = re.sub(' \(([^)]+)\)', '', string) # Removes the parentheses and the strings within them
    string = re.sub('%26', '&', string)         # Replaces ampersands
    string = re.sub('#', '', string)            # Removes hashes
    string = re.sub('%3F', '?', string)         # Replaces question marks
    string = re.sub('\/', '', string)           # Removes forward slashes
    return string

In [12]:
def clean_links(link_list):
    clean_list = [clean_string(x) for x in link_list]
    return clean_list

In [13]:
clean_show_list = clean_links(show_list)

In [14]:
clean_show_list[:10]

['Game of Thrones',
 'Westworld',
 'Big Little Lies',
 'The Deuce',
 'Succession',
 'Curb Your Enthusiasm',
 'Veep',
 'Silicon Valley',
 'Ballers',
 'High Maintenance']

### Much better. Now that I have this cleaned list, I can load it into the other notebooks instead of the 9241 separate files. 

In [16]:
with open ('./Assets_&_Data/clean_show_list.pickle', 'wb') as fp:
    pickle.dump(clean_show_list, fp)