
## Objective

The objective of this project is to build a language model that generates movie overviews using the TMDB API. We will perform exploratory data analysis (EDA) on the movie overviews and fine-tune a language model for text generation.

Group Members:
- Britty Bidari (C0861112)
- Jaspreet Kaur (C0861116)
- Vijay Seelam (C08573219)
- Sushant Giri (C0861112)
"""

In [1]:
# Step 1: Install required libraries
!pip install fastai
!pip install tmdbv3api

Collecting fastai
  Obtaining dependency information for fastai from https://files.pythonhosted.org/packages/f8/81/7df1ed81c1004c6705b666652a467822b819411858b21e1c174ceaf6d464/fastai-2.7.18-py3-none-any.whl.metadata
  Downloading fastai-2.7.18-py3-none-any.whl.metadata (9.1 kB)
Collecting fastdownload<2,>=0.0.5 (from fastai)
  Obtaining dependency information for fastdownload<2,>=0.0.5 from https://files.pythonhosted.org/packages/47/60/ed35253a05a70b63e4f52df1daa39a6a464a3e22b0bd060b77f63e2e2b6a/fastdownload-0.0.7-py3-none-any.whl.metadata
  Downloading fastdownload-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Collecting fastcore<1.8,>=1.5.29 (from fastai)
  Obtaining dependency information for fastcore<1.8,>=1.5.29 from https://files.pythonhosted.org/packages/d7/3a/a0b1c764426622287c9b6547d4ea637c406bc884141814df4a5ebab3ab9b/fastcore-1.7.29-py3-none-any.whl.metadata
  Downloading fastcore-1.7.29-py3-none-any.whl.metadata (3.6 kB)
Collecting fastprogress>=0.2.4 (from fastai)
  Obtaining de

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\BrittyBidari\\anaconda3\\Lib\\site-packages\\~umpy.libs\\libscipy_openblas64_-caad452230ae4ddb57899b8b3a33c55c.dll'
Consider using the `--user` option or check the permissions.



Collecting tmdbv3api
  Obtaining dependency information for tmdbv3api from https://files.pythonhosted.org/packages/35/fb/9d575292bb7794a7a85bcdbf6c09928aae5ca2ae9f684f7fbbd902e281c4/tmdbv3api-1.9.0-py3-none-any.whl.metadata
  Downloading tmdbv3api-1.9.0-py3-none-any.whl.metadata (8.0 kB)
Downloading tmdbv3api-1.9.0-py3-none-any.whl (25 kB)
Installing collected packages: tmdbv3api
Successfully installed tmdbv3api-1.9.0


In [2]:
!pip install wordcloud

Collecting wordcloud
  Obtaining dependency information for wordcloud from https://files.pythonhosted.org/packages/00/09/abb305dce85911b8fba382926cfc57f2f257729e25937fdcc63f3a1a67f9/wordcloud-1.9.4-cp311-cp311-win_amd64.whl.metadata
  Downloading wordcloud-1.9.4-cp311-cp311-win_amd64.whl.metadata (3.5 kB)
Downloading wordcloud-1.9.4-cp311-cp311-win_amd64.whl (299 kB)
   ---------------------------------------- 0.0/299.9 kB ? eta -:--:--
   ---- ----------------------------------- 30.7/299.9 kB 1.3 MB/s eta 0:00:01
   --------------------- ------------------ 163.8/299.9 kB 2.0 MB/s eta 0:00:01
   ---------------------------------------- 299.9/299.9 kB 2.7 MB/s eta 0:00:00
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.4


In [3]:
# Step 2: Import necessary modules
import numpy as np
from fastai.text.all import *

ModuleNotFoundError: No module named 'fastai'

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [None]:
# Step 3: Setup your data
from tmdbv3api import TMDb
tmdb = TMDb()
tmdb.api_key = 'db707386d4a7bde24b75391363fb26c6'

In [None]:
# Fetch movie overviews from TMDB API
from tmdbv3api import Movie
movie = Movie()

movie_list = movie.top_rated()
movie_overviews = [m.overview for m in movie_list]

In [None]:
# Save movie overviews to a text file
with open('/content/movie_overviews.txt', 'w') as f:
    for overview in movie_overviews:
        f.write(overview + '\n')


In [None]:
# Step 4: Prepare your data
path = Path('/content')
dls_lm = TextDataLoaders.from_folder(path, is_lm=True, valid_pct=0.1)


In [None]:
# Basic EDA
num_movies = len(movie_overviews)
avg_length = np.mean([len(overview.split()) for overview in movie_overviews])

print(f"Total number of movie overviews: {num_movies}")
print(f"Average length of movie overviews: {avg_length:.2f} words")

# Create a WordCloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(movie_overviews))

# Plot the WordCloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('WordCloud of Movie Overviews')
plt.show()

In [None]:
# Step 5: Fine-tune the language model
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fine_tune(4, 1e-2)


In [None]:
# Step 6: Generate text using the trained model
prompt = "Once upon a time"
generated_text = learn.predict(prompt, n_words=100, temperature=0.7)
print(generated_text)

In [None]:
# Step 7: EDA and Visualization
# Get the lengths of generated text samples
generated_lengths = [len(text.split()) for text in generated_text.split('\n')]

# Plot the distribution of generated text lengths
plt.hist(generated_lengths, bins=20, edgecolor='black')
plt.xlabel('Generated Text Length')
plt.ylabel('Frequency')
plt.title('Distribution of Generated Text Lengths')
plt.show()

In [None]:
# Create a word cloud from the generated text
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(generated_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Generated Text')
plt.show()

## Conclusion

In this project, we successfully fetched movie overviews from the TMDB API, performed exploratory data analysis (EDA), and built a language model to generate movie overviews. The WordCloud visualization provided insights into the most common words used in the overviews. This project showcases the potential of language models in generating creative text based on existing data.