# üî∑ PART 1: Exploratory Data Analysis üî∑

In this Jupyter notebook, we analyze our given external datasets through a **basic comprehensive** lens: we manipulate, curate, and prepare data in order to ask critical questions and gain an effective understanding of how to perform higher-level prediction-driven data modification.

---

## üîµ TABLE OF CONTENTS üîµ <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the preprocessing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data preprocessing.

#### 2. [Section B: Manipulating Our Data](#section-B)

    Data manipulation operations, including (but not limited to) 
    null value imputation and data cleaning. 

#### 3. [Section C: Visualizing Trends Across Our Data](#section-C)

    Data visualizations to outline trends and patterns 
    inherent across our data that may mandate further analysis.

#### 4. [Section D: Saving Our Interim Datasets](#section-D)

    Saving preprocessed data states for further access.

#### 5. [Appendix: Supplementary Custom Objects](#appendix)

    Custom object architectures used throughout the data preprocessing.
    
---

## üîπ Section A: Imports and Initializations <a name="section-A"></a>

General Imports for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import string
import re
import unicodedata
from unicodedata import normalize
import nltk

#### Custom Algorithmic Structures for Processed Data Visualization.

In [2]:
import sys
sys.path.append("../source/structures")

# TODO: Place custom structures from `../source/structures` here.

##### [(back to top)](#TOC)

---

## üîπ Section B: Manipulating Our Data <a name="section-B"></a>

In [3]:
#function to read text file
def get_text(file):
#load in the file
    with open(file, mode="rt", encoding="utf-8") as f:
#read the file
        data = f.read() 
#return text from file
        return data

I decided to test my function on a file that includes Japanese-English bilingual pairs. After loading in the file, I used string slicing to preview the first couple of lines of text. 

A few observations:

- Below I see Japanese text alongside the English translation. 
- There are 90897 sequences separated by a tab delimiter.
- Each sequence contains attribution for each translation from the website which includes non-alphabetic characters and user-names. 
- There are multiple Japanese translations for the same English words. 
- Each English word is punctuated which may be important to the meaning of the word in Japanese. 
- It starts with very simple sequences and there are more complex ones at the end. (I might consider reducing the data for a simpler model).

In [4]:
#previewing file
text = get_text('../datasets/external/fra.txt')
#print total count of sequences
print(text.count('\t') +1)
#print first few lines of text
print(text[1:2000])
#print last few lines of text
print(text[490000:500000])

341303
o.	Va !	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)
Hi.	Salut !	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)
Hi.	Salut.	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)
Run!	Cours‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)
Run!	Courez‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)
Who?	Qui ?	CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #4366796 (gillux)
Wow!	√áa alors‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #374631 (zmoo)
Fire!	Au feu !	CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #4627939 (sacredceltic)
Help!	√Ä l'aide‚ÄØ!	CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #128430 (sysko)
Jump.	Saute.	CC-BY 2.0 (France) Attribution: tatoeba.org #631038 (Shishir) & #2416938 (Phoenix)
Stop!	√áa suffit‚ÄØ!	CC-BY 2.0 (France

### Preprocessing Data

#### Converting Lines to Sentence Pairs

In [5]:
def to_pairs(txt):
    
    lines = txt.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    
    return pairs
 

In [6]:
pairs = to_pairs(text)
print(pairs[0:50])

[['Go.', 'Va !', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)'], ['Hi.', 'Salut !', 'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)'], ['Hi.', 'Salut.', 'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)'], ['Run!', 'Cours\u202f!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)'], ['Run!', 'Courez\u202f!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)'], ['Who?', 'Qui ?', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #4366796 (gillux)'], ['Wow!', '√áa alors\u202f!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #374631 (zmoo)'], ['Fire!', 'Au feu !', 'CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #4627939 (sacredceltic)'], ['Help!', "√Ä l'aide\u202f!", 'CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #128430 (sysko)'], ['Jump.', 'Saute.', 'CC-BY 2

##### [(back to top)](#TOC)

---

## üîπ Section C: Visualizing Trends Across Our Data <a name="section-C"></a>

##### [(back to top)](#TOC)

---

## üîπ Section D: Saving Our Interim Data <a name="section-D"></a>

##### [(back to top)](#TOC)

---

## üîπ Appendix: Supplementary Custom Objects <a name="appendix"></a>

##### [(back to top)](#TOC)

---