<h2>MapReduce</h2>

Reset all variables before running the notebook

In [11]:
%reset -f

<h3>Importing libraries</h3>

In [12]:
import pandas as pd
from collections import Counter
from datetime import datetime
import os
import time

This option is set to avoid scientific notation in pandas dataframe

In [13]:
pd.options.display.float_format = '{:.2f}'.format

<h3>Reading File and creating the dataframe</h3>

Start timer

In [14]:
start_time = time.time()

In [15]:
df = pd.read_csv('transformed_data/output.csv', sep=';', dtype=str, encoding='utf-8')

In [16]:
df.head()

Unnamed: 0,Id,Title,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


<h3>Listing all words</h3>

Creating a unique list with the content of all cells

In [17]:
cells_content = []
for column in df.columns:
    cells_content += df[column].astype('str').str.split(';').explode().to_list()

cells_content = [word.strip() for word in cells_content if word.strip().isalpha()]

Deleting the df dataframe to release memory

In [18]:
del df

Filtering and Splitting cells with multiples words

In [19]:
word_list = []
for content in cells_content:
    content_splitted = content.split()
    if len(content_splitted) == 1:
        word_list += content_splitted
    elif len(content_splitted) > 1:
        for word in content_splitted:
            word_list.append(word)

In [20]:
del cells_content

<h3>Creating log dataframe</h3>

In [21]:
header = ['date and time',
          'program',
          'Execution time (s)', 
          'Qty of words', 
          'Qty of repeated words', 
          'Qty of non repeated words', 
          '1ª most repeated word', 
          '1ª most repeated word (Freq)',
          '2ª most repeated word', 
          '2ª most repeated word (Freq)',
          '3ª most repeated word',
          '3ª most repeated word (Freq)',]
log_df = pd.DataFrame(columns=header)

<h3>Counting words</h3>

In [22]:
counted_words = Counter(word_list)
most_common = counted_words.most_common()
repeated_words = [word for word in most_common if word[1] > 1]
non_repeated_words = [word for word in most_common if word[1] == 1]

Stop timer

In [23]:
stop_time = time.time()
execution_time = stop_time - start_time

# Convert the elapsed time to hours, minutes, and seconds
hours, remainder = divmod(execution_time, 3600)
minutes, seconds = divmod(remainder, 60)

Counting the repeated words

In [24]:
log = {
    'date and time': datetime.now().strftime('%Y-%m-%d'),
    'program': 'MapReduce', 
    'Execution time (s)': f"{round(execution_time, 3)}", 
    'Qty of words': len(word_list), 
    'Qty of repeated words': len(repeated_words), 
    'Qty of non repeated words': len(non_repeated_words), 
    '1ª most repeated word': counted_words.most_common()[0][0], 
    '1ª most repeated word (Freq)': counted_words.most_common()[0][1],
    '2ª most repeated word': counted_words.most_common()[1][0], 
    '2ª most repeated word (Freq)': counted_words.most_common()[1][1],
    '3ª most repeated word': counted_words.most_common()[2][0],
    '3ª most repeated word (Freq)': counted_words.most_common()[2][1],
}

In [25]:
log_df.loc[len(log_df)] = log

In [26]:
log_df.head()

Unnamed: 0,date and time,program,Execution time (s),Qty of words,Qty of repeated words,Qty of non repeated words,1ª most repeated word,1ª most repeated word (Freq),2ª most repeated word,2ª most repeated word (Freq),3ª most repeated word,3ª most repeated word (Freq)
0,2023-11-07,MapReduce,98.581,300399,11646,12032,Excellent,6584,Persuasion,5740,Holes,4060
