### 1. Data Understanding

In [1]:
import pandas as pd

data = pd.read_csv("afr.txt", delimiter='\t', header=None, names=['source', 'target', 'comments'])


In [2]:
data.head()

Unnamed: 0,source,target,comments
0,Come in.,Gaan binne.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
1,I'm full.,Ek is vol.,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
2,She runs.,Sy hardloop.,CC-BY 2.0 (France) Attribution: tatoeba.org #6...
3,You lost.,Jy verloor.,CC-BY 2.0 (France) Attribution: tatoeba.org #6...
4,Go inside.,Gaan binne.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...


In [3]:
#Check what data we have, their count and their data type
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 912 entries, 0 to 911
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   source    912 non-null    object
 1   target    912 non-null    object
 2   comments  912 non-null    object
dtypes: object(3)
memory usage: 21.5+ KB


In [10]:
# drop comments
data = data.drop(['comments'], axis=1)

In [11]:
#Check the shape of the dataset
data.shape

(912, 2)

In [12]:
#Check unique values
data.nunique()

source    897
target    848
dtype: int64

### 2. Data Cleaning

Now I want to check if there are any empty columns and replace them with values

In [13]:
data.isnull().values.any()

False

In [43]:
# Remove all dots from the "Afrikaans" column
data["Afrikaans"] = data["source"].str.replace(".", "")

# Remove all dots from the "English" column
data["English"] = data["target"].str.replace(".", "")

  data["Afrikaans"] = data["source"].str.replace(".", "")
  data["English"] = data["target"].str.replace(".", "")


In [44]:
# Remove all question marks from the "Afrikaans" column
data["Afrikaans"] = data["Afrikaans"].str.replace("?", "")

# Remove all question marks from the "English" column
data["English"] = data["English"].str.replace("?", "")

  data["Afrikaans"] = data["Afrikaans"].str.replace("?", "")
  data["English"] = data["English"].str.replace("?", "")


In [45]:
data.drop(['source', 'target'], axis= 1)

Unnamed: 0,Afrikaans,English
0,Come in,Gaan binne
1,I'm full,Ek is vol
2,She runs,Sy hardloop
3,You lost,Jy verloor
4,Go inside,Gaan binne
...,...,...
907,A 5% consumption tax is levied on purchases of...,'n Verbruiksbelasting van 5% word gehef op die...
908,"Since you have a sore throat and a fever, you ...",Aangesien jy 'n seer keel en koors het moet jy...
909,"In the past, the boys were taught to fend for ...",In die verlede was die seuns geleer om vir hul...
910,These three hours of driving have worn me out ...,Hierdie drie ure van bestuur het my uitgeput K...


### Data Preperation

Now I'll tokenize the words and count how often they appear in this data. 

In [54]:
import nltk
from collections import Counter

# Tokenize the sentences in the "Afrikaans" and "English" columns
Afrikaans_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in data['Afrikaans']]
English_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in data['English']]

# Count the frequency of each word in the tokenized sentences
Afrikaans_word_counts = Counter([word for sentence in Afrikaans_sentences for word in sentence])
English_word_counts = Counter([word for sentence in English_sentences for word in sentence])

# Sort the word frequencies in descending order
most_common_source_words = Afrikaans_word_counts.most_common(10)
most_common_target_words = English_word_counts.most_common(10)

print("Most common words in 'source' column:")
print(most_common_source_words)

print("Most common words in 'target' column:")
print(most_common_target_words)


Most common words in 'source' column:
[('i', 333), ('tom', 223), ('you', 219), ("n't", 210), ('to', 158), ('the', 131), ('be', 125), ('a', 109), ('is', 104), ('do', 86)]
Most common words in 'target' column:
[('nie', 456), ('ek', 323), ('tom', 222), ('het', 201), ('is', 193), ('die', 153), ('jy', 150), ("'n", 109), ('dit', 105), ('gaan', 91)]


The words 'I' and 'tom' seems to appear the most in the English dataset. <br>
The words 'nie' and 'ek' seems to appear the most in the Afrikaans dataset.

In [47]:
data.dtypes

source       object
target       object
Afrikaans    object
English      object
dtype: object

In [53]:
# Check for commas in each row of the dataframe
has_commas = data.apply(lambda row: row.str.contains(",").any(), axis=1)

has_excMark = data.apply(lambda row: row.str.contains("!").any(), axis=1)

# Print the rows that contain commas
print(data[has_commas])
print(data[has_excMark])


                                                source  \
117                                  The bill, please.   
170                                 The check, please.   
224                                Now, don't be late.   
257                               You can go now, sir.   
392                           What are you doing, Dad?   
477                        I'd like one stamp, please.   
601                     You don't know my dad, do you?   
617                    This, however, is not possible.   
619                    We'll have a problem, won't we?   
620                    Will you pay attention, please?   
643                   Would you move your car, please?   
644                   Would you move your car, please?   
672                 He invested 500,000 yen in stocks.   
684                 Naturally, that's why we are here.   
727                You can use my car, if you want to.   
755              I've told you that before, haven't I?   
775           