# üî∑ PART 1: Exploratory Data Analysis üî∑

In this Jupyter notebook, we analyze our given external datasets through a **basic comprehensive** lens: we manipulate, curate, and prepare data in order to ask critical questions and gain an effective understanding of how to perform higher-level prediction-driven data modification.

---

## üîµ TABLE OF CONTENTS üîµ <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the preprocessing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data preprocessing.

#### 2. [Section B: Manipulating Our Data](#section-B)

    Data manipulation operations, including (but not limited to) 
    null value imputation and data cleaning. 

#### 3. [Section C: Visualizing Trends Across Our Data](#section-C)

    Data visualizations to outline trends and patterns 
    inherent across our data that may mandate further analysis.

#### 4. [Section D: Saving Our Interim Datasets](#section-D)

    Saving preprocessed data states for further access.

#### 5. [Appendix: Supplementary Custom Objects](#appendix)

    Custom object architectures used throughout the data preprocessing.
    
---

## üîπ Section A: Imports and Initializations <a name="section-A"></a>

General Imports for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import string
import re
import unicodedata
from unicodedata import normalize
import nltk

#### Custom Algorithmic Structures for Processed Data Visualization.

In [2]:
import sys
sys.path.append("../source/structures")

# TODO: Place custom structures from `../source/structures` here.

##### [(back to top)](#TOC)

---

## üîπ Section B: Manipulating Our Data <a name="section-B"></a>

In [3]:
#function to read text file
def get_text(file):
#load in the file
    with open(file, mode="rt", encoding="utf-8") as f:
#read the file
        data = f.read() 
#return text from file
        return data

I decided to test my function on a file that includes Japanese-English bilingual pairs. After loading in the file, I used string slicing to preview the first couple of lines of text. 

A few observations:

- Below I see Japanese text alongside the English translation. 
- There are 90897 sequences separated by a tab delimiter.
- Each sequence contains attribution for each translation from the website which includes non-alphabetic characters and user-names. 
- There are multiple Japanese translations for the same English words. 
- Each English word is punctuated which may be important to the meaning of the word in Japanese. 
- It starts with very simple sequences and there are more complex ones at the end. (I might consider reducing the data for a simpler model).

In [4]:
#previewing file
text = get_text('../datasets/external/fra.txt')
#print total count of sequences
print(text.count('\t') +1)
#print first few lines of text
print(text[1:2000])
#print last few lines of text
print(text[490000:500000])

341303
o.	Va !	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)
Hi.	Salut !	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)
Hi.	Salut.	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)
Run!	Cours‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)
Run!	Courez‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)
Who?	Qui ?	CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #4366796 (gillux)
Wow!	√áa alors‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #374631 (zmoo)
Fire!	Au feu !	CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #4627939 (sacredceltic)
Help!	√Ä l'aide‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #128430 (sysko)
Jump.	Saute.	CC-BY 2.0 (France) Attribution: tatoeba.org #631038 (Shishir) & #2416938 (Phoenix)
Stop!	√áa suffit‚ÄØ!	CC-BY 2.0 (France

### Preprocessing Data

#### Converting Lines to Sentence Pairs

In [5]:
def to_pairs(txt):
    
    lines = txt.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    
    return pairs
 

In [6]:
pairs = to_pairs(text)
print(pairs[0:50])

[['Go.', 'Va !', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)'], ['Hi.', 'Salut !', 'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)'], ['Hi.', 'Salut.', 'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)'], ['Run!', 'Cours\u202f!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)'], ['Run!', 'Courez\u202f!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)'], ['Who?', 'Qui ?', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #4366796 (gillux)'], ['Wow!', '√áa alors\u202f!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #374631 (zmoo)'], ['Fire!', 'Au feu !', 'CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #4627939 (sacredceltic)'], ['Help!', "√Ä l'aide\u202f!", 'CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #128430 (sysko)'], ['Jump.', 'Saute.', 'CC-BY 2

#### Remove Unnecessary Text from String

In [7]:
def remove_chars(txt):
    #removing the last item in list 
    for line in txt:
        line.pop()
        
    return txt

In [8]:
no_chars = remove_chars(pairs)

print(no_chars[0:100])

[['Go.', 'Va !'], ['Hi.', 'Salut !'], ['Hi.', 'Salut.'], ['Run!', 'Cours\u202f!'], ['Run!', 'Courez\u202f!'], ['Who?', 'Qui ?'], ['Wow!', '√áa alors\u202f!'], ['Fire!', 'Au feu !'], ['Help!', "√Ä l'aide\u202f!"], ['Jump.', 'Saute.'], ['Stop!', '√áa suffit\u202f!'], ['Stop!', 'Stop\u202f!'], ['Stop!', 'Arr√™te-toi !'], ['Wait!', 'Attends !'], ['Wait!', 'Attendez !'], ['Go on.', 'Poursuis.'], ['Go on.', 'Continuez.'], ['Go on.', 'Poursuivez.'], ['Hello!', 'Bonjour !'], ['Hello!', 'Salut !'], ['I see.', 'Je comprends.'], ['I try.', "J'essaye."], ['I won!', "J'ai gagn√© !"], ['I won!', "Je l'ai emport√© !"], ['I won.', 'J‚Äôai gagn√©.'], ['Oh no!', 'Oh non !'], ['Attack!', 'Attaque !'], ['Attack!', 'Attaquez !'], ['Cheers!', 'Sant√© !'], ['Cheers!', '√Ä votre sant√© !'], ['Cheers!', 'Merci !'], ['Cheers!', 'Tchin-tchin !'], ['Get up.', 'L√®ve-toi.'], ['Go now.', 'Va, maintenant.'], ['Go now.', 'Allez-y maintenant.'], ['Go now.', 'Vas-y maintenant.'], ['Got it!', "J'ai pig√© !"], ['Got it!'

In [None]:
def to

In [33]:
#normalize unicode
def norma_uni(txt):
    for pair in txt:
        for line in pair:
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line.decode('UTF-8')
    
    return txt

In [34]:
norma = norma_uni(no_chars)

print(norma[0:100])

[['Go.', 'Va !'], ['Hi.', 'Salut !'], ['Hi.', 'Salut.'], ['Run!', 'Cours\u202f!'], ['Run!', 'Courez\u202f!'], ['Who?', 'Qui ?'], ['Wow!', '√áa alors\u202f!'], ['Fire!', 'Au feu !'], ['Help!', "√Ä l'aide\u202f!"], ['Jump.', 'Saute.'], ['Stop!', '√áa suffit\u202f!'], ['Stop!', 'Stop\u202f!'], ['Stop!', 'Arr√™te-toi !'], ['Wait!', 'Attends !'], ['Wait!', 'Attendez !'], ['Go on.', 'Poursuis.'], ['Go on.', 'Continuez.'], ['Go on.', 'Poursuivez.'], ['Hello!', 'Bonjour !'], ['Hello!', 'Salut !'], ['I see.', 'Je comprends.'], ['I try.', "J'essaye."], ['I won!', "J'ai gagn√© !"], ['I won!', "Je l'ai emport√© !"], ['I won.', 'J‚Äôai gagn√©.'], ['Oh no!', 'Oh non !'], ['Attack!', 'Attaque !'], ['Attack!', 'Attaquez !'], ['Cheers!', 'Sant√© !'], ['Cheers!', '√Ä votre sant√© !'], ['Cheers!', 'Merci !'], ['Cheers!', 'Tchin-tchin !'], ['Get up.', 'L√®ve-toi.'], ['Go now.', 'Va, maintenant.'], ['Go now.', 'Allez-y maintenant.'], ['Go now.', 'Vas-y maintenant.'], ['Got it!', "J'ai pig√© !"], ['Got it!'

In [46]:
#lowercase characters
def lower_case(txt):
    for pair in txt:
        for line in pair:
            word = [line.lower() for line in pair]
    
    return txt

In [47]:
lower = lower_case(no_chars)
print(lower[0:100])

[['Go.', 'Va !'], ['Hi.', 'Salut !'], ['Hi.', 'Salut.'], ['Run!', 'Cours\u202f!'], ['Run!', 'Courez\u202f!'], ['Who?', 'Qui ?'], ['Wow!', '√áa alors\u202f!'], ['Fire!', 'Au feu !'], ['Help!', "√Ä l'aide\u202f!"], ['Jump.', 'Saute.'], ['Stop!', '√áa suffit\u202f!'], ['Stop!', 'Stop\u202f!'], ['Stop!', 'Arr√™te-toi !'], ['Wait!', 'Attends !'], ['Wait!', 'Attendez !'], ['Go on.', 'Poursuis.'], ['Go on.', 'Continuez.'], ['Go on.', 'Poursuivez.'], ['Hello!', 'Bonjour !'], ['Hello!', 'Salut !'], ['I see.', 'Je comprends.'], ['I try.', "J'essaye."], ['I won!', "J'ai gagn√© !"], ['I won!', "Je l'ai emport√© !"], ['I won.', 'J‚Äôai gagn√©.'], ['Oh no!', 'Oh non !'], ['Attack!', 'Attaque !'], ['Attack!', 'Attaquez !'], ['Cheers!', 'Sant√© !'], ['Cheers!', '√Ä votre sant√© !'], ['Cheers!', 'Merci !'], ['Cheers!', 'Tchin-tchin !'], ['Get up.', 'L√®ve-toi.'], ['Go now.', 'Va, maintenant.'], ['Go now.', 'Allez-y maintenant.'], ['Go now.', 'Vas-y maintenant.'], ['Got it!', "J'ai pig√© !"], ['Got it!'

In [24]:
table = str.maketrans('', '', string.punctuation)

#remove punctuation
def remove_punct(corpus):
    for pair in corpus:
        for line in pair:
            char = [char.translate(table) for char in line]          
    return corpus

In [25]:
stripped = remove_punct(norma)
print(stripped[0:100])

[['Go.', 'Va !'], ['Hi.', 'Salut !'], ['Hi.', 'Salut.'], ['Run!', 'Cours\u202f!'], ['Run!', 'Courez\u202f!'], ['Who?', 'Qui ?'], ['Wow!', '√áa alors\u202f!'], ['Fire!', 'Au feu !'], ['Help!', "√Ä l'aide\u202f!"], ['Jump.', 'Saute.'], ['Stop!', '√áa suffit\u202f!'], ['Stop!', 'Stop\u202f!'], ['Stop!', 'Arr√™te-toi !'], ['Wait!', 'Attends !'], ['Wait!', 'Attendez !'], ['Go on.', 'Poursuis.'], ['Go on.', 'Continuez.'], ['Go on.', 'Poursuivez.'], ['Hello!', 'Bonjour !'], ['Hello!', 'Salut !'], ['I see.', 'Je comprends.'], ['I try.', "J'essaye."], ['I won!', "J'ai gagn√© !"], ['I won!', "Je l'ai emport√© !"], ['I won.', 'J‚Äôai gagn√©.'], ['Oh no!', 'Oh non !'], ['Attack!', 'Attaque !'], ['Attack!', 'Attaquez !'], ['Cheers!', 'Sant√© !'], ['Cheers!', '√Ä votre sant√© !'], ['Cheers!', 'Merci !'], ['Cheers!', 'Tchin-tchin !'], ['Get up.', 'L√®ve-toi.'], ['Go now.', 'Va, maintenant.'], ['Go now.', 'Allez-y maintenant.'], ['Go now.', 'Vas-y maintenant.'], ['Got it!', "J'ai pig√© !"], ['Got it!'

In [13]:
# Get all unicode characters
all_chars = (chr(i) for i in range(sys.maxunicode))
# Get all non printable characters
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
# Create regex of above characters
control_char_re = re.compile('[%s]' % re.escape(control_chars))

#remove non-printable chars from token
def re_print(line):
    return control_char_re.sub('', str(line))

#### Cleaning List of Lines

In [15]:
#regex for char filtering
re_print = re.compile('[^%s]' % re.escape(string.printable))

#clean list of lines
def clean_txt(pairs):
    
    cleaned = []
  
    for pair in pairs:
        clean_pair = []
        for line in pair:
        #normalize unicode
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line.decode('UTF-8')
    
        #split by white space
            line = tokenize_wht(line)
        #lowercase
            line = lower_case(line)
        #remove punctuation from token
            line = remove_punct(line) 
        #remove non-printable chars from token
            line = [re_print.sub('', str(w)) for w in line]
        #remove non_alpha characters
            alpha_chars = alpha_only(line)
        #store string
            clean_pair.append(" ".join(str(line)))
            
        cleaned.append(clean_pair)
        
    return cleaned

In [16]:
text = clean_txt(no_chars)
print(text[0:500])

NameError: name 'tokenize_wht' is not defined

##### [(back to top)](#TOC)

---

## üîπ Section C: Visualizing Trends Across Our Data <a name="section-C"></a>

##### [(back to top)](#TOC)

---

## üîπ Section D: Saving Our Interim Data <a name="section-D"></a>

##### [(back to top)](#TOC)

---

## üîπ Appendix: Supplementary Custom Objects <a name="appendix"></a>

##### [(back to top)](#TOC)

---