This ipynb is to convert the review data into multiple rows,

In [4]:
import pandas as pd
import ast

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
df = pd.read_csv("/content/drive/MyDrive/IMDB Project/Cleaning1/cleaned_data/reviews_and_rating_and_ids.csv")

In [7]:
df.shape

(69419, 4)

In [8]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,imdb_id,title,reviews_data
0,0,tt0114126,Thunderbolt,"[""{'review': 'Jackie Loh Chan is a motor mecha..."


### Copying the dataframe for editing

In [9]:
df_edit = df.copy()

Checking Columns

In [10]:
df_edit.columns

Index(['Unnamed: 0', 'imdb_id', 'title', 'reviews_data'], dtype='object')

Checking the head

In [11]:
df_edit.head()

Unnamed: 0.1,Unnamed: 0,imdb_id,title,reviews_data
0,0,tt0114126,Thunderbolt,"[""{'review': 'Jackie Loh Chan is a motor mecha..."
1,1,tt5161762,Another Woman,['{\'review\': \'As I sat down to watch the 20...
2,2,tt4154796,Avengers: Endgame,"['{\'review\': ""But its a pretty good film. A ..."
3,3,tt1630029,Avatar: The Way of Water,"['{\'review\': ""Technically gorgeous, but the ..."
4,4,tt5433140,Fast X,"[""{'review': 'First, the car actions are great..."


Dropping the Unnamed: 0 column

In [13]:
df_edit = df_edit.drop(columns=['Unnamed: 0'])


## **Part 1:** Defining Function to convert reviews data from string to list of dictionaries, testing with the first movie:

Step 1: Extract the First Row of 'reviews_data' Column

In [14]:
first_row_reviews = df_edit['reviews_data'].iloc[0]


Step 2: Convert the String to a List of Dictionaries

In [15]:
try:
    reviews_list = ast.literal_eval(first_row_reviews)
except Exception as e:
    print(f"Error: {e}")


Step 3: Create a DataFrame from the List of Dictionaries

Next, let's try to create a DataFrame from this list of dictionaries using the pd.DataFrame function.

In [16]:
try:
    reviews_df = pd.DataFrame(reviews_list)
except Exception as e:
    print(f"Error: {e}")

Step 4: View the DataFrame

In [17]:
reviews_df.head(5)

Unnamed: 0,0
0,{'review': 'Jackie Loh Chan is a motor mechani...
1,"{'review': ""One of the most important things i..."
2,"{'review': ""I read somewher that Jackie was st..."
3,{'review': 'This moving picture deals with Cha...
4,{'review': 'This is another action-packed movi...


Step 5: Convert the String to a Dictionary

In [18]:
def convert_row_to_dict(row):
    try:
        return ast.literal_eval(row)
    except Exception as e:
        print(f"Error: {e}")
        return {}

reviews_df = reviews_df.applymap(convert_row_to_dict)


Step 6: Split the Dictionary into Separate Columns

In [19]:
reviews_df = pd.json_normalize(reviews_df[0])


In [20]:
reviews_df.head(5)

Unnamed: 0,review,reviewer,rating
0,Jackie Loh Chan is a motor mechanic whose fath...,bob the moo,
1,One of the most important things in a Jackie C...,sagacity_,
2,I read somewher that Jackie was still recoveri...,rutt13-1,8.0
3,This moving picture deals with Chan Foh To (Ja...,ma-cortes,6.0
4,This is another action-packed movie starring J...,OllieSuave-007,6.0


Step 8: Attach 'imdb_id' and 'title' to Each Row

We can add the 'imdb_id' and 'title' from the original DataFrame to each row in the reviews DataFrame. Since we're working with the first row of the original DataFrame, we can directly access these values:

In [21]:
imdb_id = df_edit['imdb_id'].iloc[0]
title = df_edit['title'].iloc[0]


Now, let's add these values as new columns in the reviews DataFrame:

In [22]:
reviews_df['imdb_id'] = imdb_id
reviews_df['title'] = title

In [23]:
reviews_df

Unnamed: 0,review,reviewer,rating,imdb_id,title
0,Jackie Loh Chan is a motor mechanic whose fath...,bob the moo,,tt0114126,Thunderbolt
1,One of the most important things in a Jackie C...,sagacity_,,tt0114126,Thunderbolt
2,I read somewher that Jackie was still recoveri...,rutt13-1,8.0,tt0114126,Thunderbolt
3,This moving picture deals with Chan Foh To (Ja...,ma-cortes,6.0,tt0114126,Thunderbolt
4,This is another action-packed movie starring J...,OllieSuave-007,6.0,tt0114126,Thunderbolt
5,"I'm a die hard Jackie Chan fan, but ""Thunderbo...",Monkey Bastard,,tt0114126,Thunderbolt
6,"""Thunderbolt"" is probably Jackie Chan's worst ...",gridoon,4.0,tt0114126,Thunderbolt
7,An atypical Jackie Chan production in that it ...,Leofwine_draca,4.0,tt0114126,Thunderbolt
8,"First off, I found the plot a bit problematic ...",imdb-21622,6.0,tt0114126,Thunderbolt
9,At this point Hong Kong might be getting too s...,ebiros2,5.0,tt0114126,Thunderbolt


It works!!!

## **Part 2:** Recreating it with 100 movies

Step 10: Define a Function to Process a Row

In [24]:
def process_row(row):
    try:
        # Convert the 'reviews_data' string to a list of dictionaries
        reviews_list = ast.literal_eval(row['reviews_data'])

        # Create a DataFrame from the list of dictionaries
        reviews_df = pd.DataFrame(reviews_list)

        # Convert the string to a dictionary
        reviews_df = reviews_df.applymap(convert_row_to_dict)

        # Split the dictionary into separate columns
        reviews_df = pd.json_normalize(reviews_df[0])

        # Attach 'imdb_id' and 'title' to each row
        reviews_df['imdb_id'] = row['imdb_id']
        reviews_df['title'] = row['title']

        return reviews_df
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()

Step 11: Apply the Function to Each of the First 100 Rows

Now, let's apply this function to each of the first 100 rows in your DataFrame. We'll use the apply function to do this, and then concatenate the resulting DataFrames using the pd.concat function:

In [25]:
reviews_dfs = df_edit.head(100).apply(process_row, axis=1)
all_reviews_df = pd.concat(reviews_dfs.values)

In [28]:
all_reviews_df.head(200)

Unnamed: 0,review,reviewer,rating,imdb_id,title
0,Jackie Loh Chan is a motor mechanic whose fath...,bob the moo,,tt0114126,Thunderbolt
1,One of the most important things in a Jackie C...,sagacity_,,tt0114126,Thunderbolt
2,I read somewher that Jackie was still recoveri...,rutt13-1,8,tt0114126,Thunderbolt
3,This moving picture deals with Chan Foh To (Ja...,ma-cortes,6,tt0114126,Thunderbolt
4,This is another action-packed movie starring J...,OllieSuave-007,6,tt0114126,Thunderbolt
...,...,...,...,...,...
160,"A great ending, to a much loved saga. It would...",alan-68691,9,tt4154796,Avengers: Endgame
161,It was a marvels fans dream. Loved it all the ...,Prabhuraj,9,tt4154796,Avengers: Endgame
162,Loved Endgame but with a lukewarm feeling for ...,craigearl,8,tt4154796,Avengers: Endgame
163,This movie was awesome. It showed how much we ...,cotandreea,10,tt4154796,Avengers: Endgame


#### Okay! Now to export this

In [29]:
all_reviews_df.to_csv("/content/drive/MyDrive/IMDB Project/review_analysis/data/100_movies_reviews.csv", index=False)


## **Part 3** Doing this with ALL the movies





In [30]:
# Define the number of rows to process at a time
chunk_size = 20

In [31]:
# Calculate the number of chunks
num_chunks = len(df_edit) // chunk_size + 1
num_chunks

3471

In [32]:
# Define the file path
file_path = "/content/drive/MyDrive/IMDB Project/review_analysis/data/all_reviews.csv"

In [33]:
for i in range(num_chunks):
    try:
        # Extract the current chunk
        chunk = df_edit.iloc[i*chunk_size : (i+1)*chunk_size]
    except Exception as e:
        print(f"Error extracting chunk {i}: {e}")
        continue

    try:
        # Process the chunk and concatenate the results
        reviews_dfs = chunk.apply(process_row, axis=1)
        all_reviews_df = pd.concat(reviews_dfs.values)
    except Exception as e:
        print(f"Error processing chunk {i}: {e}")
        continue

    try:
        # Save the results to a CSV file
        if i == 0:
            # Write the header and data for the first chunk
            all_reviews_df.to_csv(file_path, mode='w', index=False)
        else:
            # Append the data for subsequent chunks without the header
            all_reviews_df.to_csv(file_path, mode='a', header=False, index=False)
    except Exception as e:
        print(f"Error saving chunk {i}: {e}")
        continue

    # Print the progress
    print(f"Saved reviews for {min((i+1)*chunk_size, len(df_edit))} out of {len(df_edit)} titles.")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b82874f10>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b82874e20>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b828744f0>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b828747f0>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b82874c70>
Error: malformed node or string on line 1: <ast.BoolOp object at 0x7c5b82875900>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b7dc13730>
Error: 0
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b7dc13f40>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b7dc112d0>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5bc08d2770>
Error: malformed node or string on line 1: <ast.Name object at 0x7c5b83218d00>
Error: malformed node or string on line 1: <ast.Name ob

# Appendix Section -----------------------