## Задание 1   Открой файл [distances.ipynb](src/notebooks/distances.ipynb).
* Объедини общие данные о фильмах [tmdb_5000_movies](https://files.sberdisk.ru/s/te4QbzdxKgsFQXA) и каст фильмов
[tmdb_5000_credits](https://files.sberdisk.ru/s/H9oRuXQt5mFz3T9).
* Оставь в датасете только фильмы, которые вышли в "релиз".
* Убери фильмы с пропусками в колонках ['overview', 'genres', 'keywords']
* Выведи количество фильмов, оставшихся в выборке

In [59]:
import pandas as pd


movies = pd.read_csv("/content/tmdb_5000_movies.csv")
credits = pd.read_csv("/content/tmdb_5000_credits.csv")


mergeDf = pd.merge(movies, credits, left_on='id', right_on='movie_id')
mergeDf.rename(columns={'title_x': 'title'}, inplace=True)
mergeDf.drop('title_y', axis=1, inplace=True)


filterDf = mergeDf[mergeDf['status'] == 'Released']

filterDf = filterDf.dropna(subset=['overview', 'genres', 'keywords'])


print("Количество фильмов:", filterDf.shape)

Количество фильмов: (4792, 23)


## Задание 2    Реализуем алгоритм рекомендации на основе описания фильма (`overview`) и ключевых слов к фильму (`keywords`).
Объедини тексты этих колонок и проведи предобработку:
* Замени NaN в описании фильма на пустой символ `''`
* Удали все английские стоп-слова (используй параметр `stop_words` в `TfidfVectorizer`)
* Рассчитай матрицу [Tf-Idf](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) для описания фильмов.

Выведи размер получившейся матрицы
> Параметр `max_features` в `TfidfVectorizer` должен быть равен 10000

In [60]:
!pip install scikit-learn



In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [62]:
dfCopy2 = filterDf.copy()

dfCopy2['overview'].fillna('', inplace=True)

text_data = dfCopy2['overview'] + ' ' + dfCopy2['keywords']

vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)

tfidf_matrix = vectorizer.fit_transform(text_data)


print("Размер матрицы Tf-Idf:", tfidf_matrix.shape)

Размер матрицы Tf-Idf: (4792, 10000)


## Задание 3  Рассчитай косинусное расстояние между фильмами. Составь из этой матрицы `pd.DataFrame`. Для дальнейшего удобства,
колонки и индексы таблицы назови согласно`id` фильма. \
Сохрани получившийся `DataFrame` c расстояниями в папку [assets](src/assets) с названием `distance.csv`.
А сам объединенный датасет с фильмами сохрани в папку [assets](src/assets) с названием `movies.csv`.

> Получившиеся файлы `distance.csv` и `movies.csv` пушить в GitLab не нужно!


In [63]:
from sklearn.metrics.pairwise import cosine_similarity


cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim

array([[1.        , 0.16160013, 0.08417271, ..., 0.06029947, 0.01025163,
        0.04057541],
       [0.16160013, 1.        , 0.0795629 , ..., 0.06733997, 0.        ,
        0.03835326],
       [0.08417271, 0.0795629 , 1.        , ..., 0.03939055, 0.        ,
        0.02046046],
       ...,
       [0.06029947, 0.06733997, 0.03939055, ..., 1.        , 0.01632488,
        0.03620626],
       [0.01025163, 0.        , 0.        , ..., 0.01632488, 1.        ,
        0.01202703],
       [0.04057541, 0.03835326, 0.02046046, ..., 0.03620626, 0.01202703,
        1.        ]])

In [64]:
cosine_sim.shape

(4792, 4792)

In [65]:
distance_df = pd.DataFrame(cosine_sim, index=dfCopy2['id'], columns=dfCopy2['id'])
distance_df.head(2)


id,19995,285,206647,49026,49529,559,38757,99861,767,209112,...,182291,286939,124606,14337,67238,9367,72766,231617,126186,25975
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
19995,1.0,0.1616,0.084173,0.141075,0.290498,0.163335,0.109851,0.135357,0.096274,0.092502,...,0.127258,0.0,0.062076,0.114813,0.0,0.043521,0.0,0.060299,0.010252,0.040575
285,0.1616,1.0,0.079563,0.134989,0.148862,0.155402,0.10143,0.095819,0.091001,0.078926,...,0.120289,0.0,0.056188,0.085687,0.0,0.048129,0.0,0.06734,0.0,0.038353


In [113]:
mergeDf.to_csv('/content/gdrive/MyDrive/assets/movies.csv', index=False)

In [114]:
distance_df.to_csv('/content/gdrive/MyDrive/assets/distance.csv')





Задача 4
Мы прибыли и данные о фильмах, теперь перейдем к самой реализации сервиса. Его основа находится в папке src . Для начала необходимо настроить переменные окружения
для проекта. В файле .envукажи путь к файлу distance.csvи movies.csv.

In [66]:
import colab_env

In [67]:
!pip install colab-env --upgrade



In [68]:
colab_env.__version__

'0.2.0'

In [69]:
!more gdrive/My\ Drive/vars.env

COLAB_ENV = Active
TEST = DISTANCE
MOVIES = DISTANCE


In [70]:
colab_env.envvar_handler.add_env("TEST", "DISTANCE", overwrite=True)

!more gdrive/My\ Drive/vars.env

COLAB_ENV = Active
TEST = DISTANCE
MOVIES = DISTANCE


In [71]:
!pip install colab-env -qU
from colab_env import envvar_handler

In [72]:
import os

os.getenv("MOVIES")

'DISTANCE'

In [73]:
colab_env.RELOAD()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Задача 5
Перейдем к самой реализации. Допиши метод рекомендации , чтобы он возвращал самые близкие фильмы. Попробуй протестировать работу на серию фильмов (например, "Темный рыцарь" или "Пираты Карибского моря").

In [74]:
!pip install streamlit



In [75]:
!pip install pyngrok



In [76]:
%%writefile app.py


Overwriting app.py


In [77]:
import os
import streamlit as st
from dotenv import load_dotenv
from pathlib import Path
from dotenv import load_dotenv
from typing import List, Set, Optional, Any
from typing import List, Set, Optional, Any

In [78]:
from numpy.core.fromnumeric import searchsorted














In [80]:
import streamlit as st
from PIL import Image

In [98]:
%%writefile app.py
import streamlit as st
import streamlit.components.v1 as stc

# EDA Pkgs
import pandas as pd

HTML_BANNER = """
    <div style="background-color:#464e5f;padding:10px;border-radius:10px">
    <h1 style="color:white;text-align:center;">Movie Directory App </h1>
    </div>
    """


def main():
	"""Basics on st.beta columns/layout"""

	menu = ["Home","Search","About"]
	choice = st.sidebar.selectbox("Menu",menu)
	stc.html(HTML_BANNER)

	df = pd.read_csv("final_movielens_500_db.csv")
	# Change Year to Datetime
	df['year'] = pd.to_datetime(df['year'])


	if choice == 'Home':
		st.subheader("Home")


		# with st.beta_expander("Title"):
		# 	mytext = st.text_area("Type Here")
		# 	st.write(mytext)
		# 	st.success("Hello")


		# st.dataframe(df)
		movies_title_list = df['title'].tolist()

		movie_choice = st.selectbox("Movie Title",movies_title_list)
		with st.expander('Movies DF',expanded=False):
			st.dataframe(df.head(5))

			# Filter
			img_link = df[df['title'] == movie_choice]['img_link'].values[0]
			title = df[df['title']== movie_choice]['title'].values
			genre = df[df['title']== movie_choice]['genres'].values


		# Layout
			# st.write(img_link)
			# st.image(img_link)
		c1,c2,c3 = st.columns([1,2,1])

		with c1:
			with st.expander("Title"):
				st.success(title)


		with c2:
			with st.expander("Image"):
				st.image(img_link,use_column_with=True)


		with c3:
			with st.expander("Genre"):
				st.write(genre)





	elif choice == "Search":
		st.subheader("Search Movies")

		with st.expander("Search By Year"):
			movie_year = st.number_input("Year",1995,2020)

			df_for_year = df[df['year'].dt.year == movie_year]
			st.dataframe(df_for_year)

		col1,col2,col3  = st.columns([1,2,1])

		with col1:
			with st.expander("Title"):
				for t in df_for_year['title'].tolist():
					st.success(t)


		with col2:
			with st.expander("Images"):
				for i in df_for_year['img_link'].tolist():
					st.image(i,use_column_with=True)


		with col3:
			with st.expander("Genre"):
				for g in df_for_year['genres'].tolist():
					st.write(g)







	else:
		st.subheader("About")
		st.text("Built with Streamlit")
		st.text("Елена")



if __name__ == '__main__':
	main()

Overwriting app.py


In [108]:
def recommend(movie):


    movie_index = dfCopy2[dfCopy2['title']==movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]



    for i in movies_list:
        print(dfCopy2.iloc[i[0]].title)

In [None]:
#recommend('Avatar')

In [99]:
!ls

app.py			      ngrok-stable-linux-amd64.zip.1
app.ry			      ngrok-stable-linux-amd64.zip.2
drive			      sample_data
final_movielens_500_db.csv    st_movie_app
gdrive			      tmdb_5000_credits.csv
ngrok			      tmdb_5000_movies.csv
ngrok-stable-linux-amd64.zip


In [100]:
!ngrok authtoken 2RyafJFt52bPMXmpFIl8KukyfMQ_adDLh46EsfDem8BVYbFJ

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [101]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip

--2023-07-03 14:48:47--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.202.168.65, 18.205.222.128, 54.237.133.81, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.202.168.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13921656 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.3’


2023-07-03 14:48:48 (18.1 MB/s) - ‘ngrok-stable-linux-amd64.zip.3’ saved [13921656/13921656]



In [88]:
!unzip ngrok-stable-linux-amd64.zip

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [102]:
get_ipython().system_raw('./ngrok http 8501 &')

In [103]:
from sys import stdin
! curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print (json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://d0cd-34-125-20-182.ngrok-free.app


In [104]:
!streamlit run /content/app.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.125.20.182:8501[0m
[0m
2023-07-03 14:49:24.021 Uncaught app exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
  File "/content/app.py", line 114, in <module>
    main()
  File "/content/app.py", line 61, in main
    st.image(img_link,use_column_with=True)
  File "/usr/local/lib/python3.10/dist-packages/streamlit/runtime/metrics_util.py", line 356, in wrapped_func
    result = non_optional_func(*args, **kwargs)
TypeError: ImageMixin.image() got an unexpected keyword argument 'use_column_with'
2023-07-03 14:49:36.818 Uncaught app exception
Traceback (most recent call last):
  File 