# Bespoke Example 1

**Problem**:
We have a tab-separated-file where the second column contains text.
we also have a list of words, called "grey list" we want to look for in said text.

The program has to add a new column to the file which will contain:
- `True` if any of the grey list words is in the text of that row
- `False` if none of the grey list words is present in the text of that row

### exploring

In [1]:
grey_list = ["flag", "it", "either", "lowers"]

In [2]:
filepath = "bespoke_samples/dummyfile.txt"

with open(filepath) as f:
    content = f.read()
    
print(content)

henchmen	<s>The Third Reich finally lowers its flag, Hitler and his henchmen are rendered harmless.</s>	View	2020	2349222	IRE	NONE
either	<s>The Spanish electoral system does not favor them either.</s>	The Outlook	2021	3218937	UK	NONE
either	<s>Without knowing how many people we are talking about either.</s>	The Outlook	2021	3999237	UK	NONE
either	<s>And electric trucks equipped for equipment either.</s>	Euromag	2020	7182040	UK	NONE
either	<s>Many of our new customers since confinement either.</s>	The Tribune	2020	7790669	UK	NONE
or not	<s>Is it a big gap or not that much?</s>	Cinema Times	2020	8162203	UK	NONE
or not	<s>Each character has their own vision of what they think should or should not be done.</s>	Cinema Times	2021	8384123	UK	NONE
or not	<s>Many are for having the freedom to wear it or not.</s>	Gazette	2021	8923223	USA	NONE
or not	<s>We could find him a local accent or not.</s>	Euromag	2021	9401415	UK	NONE



### handmade parser

In [3]:
rows = content.split("\n")
values = []
for row in rows:
    row_values = row.split("\t")
    if len(row_values) < 2:
        continue
    values.append(row_values[1])
values

['<s>The Third Reich finally lowers its flag, Hitler and his henchmen are rendered harmless.</s>',
 '<s>The Spanish electoral system does not favor them either.</s>',
 '<s>Without knowing how many people we are talking about either.</s>',
 '<s>And electric trucks equipped for equipment either.</s>',
 '<s>Many of our new customers since confinement either.</s>',
 '<s>Is it a big gap or not that much?</s>',
 '<s>Each character has their own vision of what they think should or should not be done.</s>',
 '<s>Many are for having the freedom to wear it or not.</s>',
 '<s>We could find him a local accent or not.</s>']

### parsing using pandas

In [4]:
# !python -m pip install pandas

In [5]:
import pandas as pd

df = pd.read_csv(filepath, sep="\t", header=None)
df

Unnamed: 0,0,1,2,3,4,5,6
0,henchmen,"<s>The Third Reich finally lowers its flag, Hi...",View,2020,2349222,IRE,NONE
1,either,<s>The Spanish electoral system does not favor...,The Outlook,2021,3218937,UK,NONE
2,either,<s>Without knowing how many people we are talk...,The Outlook,2021,3999237,UK,NONE
3,either,<s>And electric trucks equipped for equipment ...,Euromag,2020,7182040,UK,NONE
4,either,<s>Many of our new customers since confinement...,The Tribune,2020,7790669,UK,NONE
5,or not,<s>Is it a big gap or not that much?</s>,Cinema Times,2020,8162203,UK,NONE
6,or not,<s>Each character has their own vision of what...,Cinema Times,2021,8384123,UK,NONE
7,or not,<s>Many are for having the freedom to wear it ...,Gazette,2021,8923223,USA,NONE
8,or not,<s>We could find him a local accent or not.</s>,Euromag,2021,9401415,UK,NONE


In [6]:
list_of_text = df[1].to_list()
list_of_text

['<s>The Third Reich finally lowers its flag, Hitler and his henchmen are rendered harmless.</s>',
 '<s>The Spanish electoral system does not favor them either.</s>',
 '<s>Without knowing how many people we are talking about either.</s>',
 '<s>And electric trucks equipped for equipment either.</s>',
 '<s>Many of our new customers since confinement either.</s>',
 '<s>Is it a big gap or not that much?</s>',
 '<s>Each character has their own vision of what they think should or should not be done.</s>',
 '<s>Many are for having the freedom to wear it or not.</s>',
 '<s>We could find him a local accent or not.</s>']

### processing

In [7]:
sample = df.iloc[1,1]  # grabbing the value in the second row, second column
print(sample)

<s>The Spanish electoral system does not favor them either.</s>


#### manual cleaning of string

In [8]:
def manual_cleaning(sample):
    clean_sample = ''
    useful_char = "abcdefghijklmnopqrstuvwxyz "

    for letter in sample[3:-4]:
        if letter.lower() in useful_char:
            clean_sample = clean_sample + letter.lower()
    return clean_sample

print(sample[3:-4])
print(manual_cleaning(sample))

The Spanish electoral system does not favor them either.
the spanish electoral system does not favor them either


#### cleaning of string using regex module

In [9]:
import re

def clean_and_tokenize_row_text(row_text):
    clean_text = re.sub('[\W]'," ",row_text[3:-4])
    return [i.lower() for i in clean_text.split()]

In [10]:
print(clean_and_tokenize_row_text(sample))

['the', 'spanish', 'electoral', 'system', 'does', 'not', 'favor', 'them', 'either']


#### Tip: (list comprehention)

In [11]:
tokens = clean_and_tokenize_row_text(sample)

In [12]:
tokens_lower = []
for i in tokens:
    tokens_lower.append(i.lower())
    
tokens_lower = [i.lower() for i in tokens]

#### Tip: look at some stats (most common) with Counter

In [13]:
full_list = []
for row_text in list_of_text:
    row_tokens = clean_and_tokenize_row_text(row_text)
    full_list.extend(row_tokens)

In [14]:
# If we want to see the most common
from collections import Counter

counter = Counter(full_list)
counter.most_common(10)

[('not', 5),
 ('either', 4),
 ('or', 4),
 ('the', 3),
 ('are', 3),
 ('many', 3),
 ('and', 2),
 ('we', 2),
 ('for', 2),
 ('of', 2)]

### Buiding the core of the solution

In [15]:
found_column_values = []
found_grey_words = []

for row_text in list_of_text:
    found = False
    
    found_in_row = []
    for grey_word in grey_list:
        if grey_word in clean_and_tokenize_row_text(row_text):
            found = True
            found_in_row.append(grey_word)
            
    found_grey_words.append("|".join(found_in_row))   
    found_column_values.append(found)

In [16]:
result_df = df.assign(contains_grey_word=found_column_values, found_words=found_grey_words)
result_df

Unnamed: 0,0,1,2,3,4,5,6,contains_grey_word,found_words
0,henchmen,"<s>The Third Reich finally lowers its flag, Hi...",View,2020,2349222,IRE,NONE,True,flag|lowers
1,either,<s>The Spanish electoral system does not favor...,The Outlook,2021,3218937,UK,NONE,True,either
2,either,<s>Without knowing how many people we are talk...,The Outlook,2021,3999237,UK,NONE,True,either
3,either,<s>And electric trucks equipped for equipment ...,Euromag,2020,7182040,UK,NONE,True,either
4,either,<s>Many of our new customers since confinement...,The Tribune,2020,7790669,UK,NONE,True,either
5,or not,<s>Is it a big gap or not that much?</s>,Cinema Times,2020,8162203,UK,NONE,True,it
6,or not,<s>Each character has their own vision of what...,Cinema Times,2021,8384123,UK,NONE,False,
7,or not,<s>Many are for having the freedom to wear it ...,Gazette,2021,8923223,USA,NONE,True,it
8,or not,<s>We could find him a local accent or not.</s>,Euromag,2021,9401415,UK,NONE,False,


### save the results to file

In [17]:
output_filepath = "bespoke_samples/greyword_search_result.tsv"
result_df.to_csv(output_filepath, sep="\t", index=False)

## How to append text to an existing file
use the `mode="a"` parameter of `open`

In [18]:
with open("bespoke_samples/greyword_search_result.tsv", mode="a") as f:
    f.write("agjriogharoghrahgiarhgihreaghrheaighraeghargho")