# Pokemon wiki recommendation system
The project is about recommending pokemon articles from https://bulbapedia.bulbagarden.net/wiki/Main_Page, based on the provided pokemons (that the user may have read about, be interested in, etc.).

Project files can be found on GitHub - https://github.com/bujowskis/pokemon-wiki-recommender.

# Scrapping

We will be using scrapy to obtain the data from Bulbapedia.

Scrapy project files can be found on GitHub - https://github.com/bujowskis/pokemon-wiki-recommender. If you wish to run the scrapper in Google Colab, you can upload them to the runtime and run the below commands.

If you don't wish to run the scrapper yourself, you can skip to [Loading scrapped and preprocessed data](#loading_preprocessed).

The general idea for scraping the content is that there are two types of pages holding pokemon data we're interested in:
- list of content page
  - contains links to articles on individual pokemons
  - may (or may not) contain link to next list of content page - this is due to the fact that there's a limit on the amount of pokemons that may be displayed on a single list of content page (at the time of writing this, 200 pokemons/page), so the rest is rendered on a next page
- individual article page - contains info on the individual pokemon

So, to scrap all the pokemon data, the crawler works as follows:
1. We start from the following page, which lists first N pokemons - https://bulbapedia.bulbagarden.net/wiki/Category:Pok%C3%A9mon
2. We extract the links to individual articles and parse their content (select all paragraphs, extract, and join their text)
3. We check if there's a next list page to go to
  - If so, we parse the next list page in the same way as the first one (step 2)
  - If not, it means the crawler covered all list pages and articles - crawling complete

In [1]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.11.0-py2.py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.4/286.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Twisted<23.8.0,>=18.9.0 (from scrapy)
  Downloading Twisted-22.10.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.1.0-py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-23.1.0-py3-none-any.whl (12 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.1.

In [2]:
!scrapy startproject ir_project1_wiki_recommender

New Scrapy project 'ir_project1_wiki_recommender', using template directory '/usr/local/lib/python3.10/dist-packages/scrapy/templates/project', created in:
    /content/ir_project1_wiki_recommender

You can start your first spider with:
    cd ir_project1_wiki_recommender
    scrapy genspider example example.com


In [3]:
%cd /content/ir_project1_wiki_recommender

/content/ir_project1_wiki_recommender


In [4]:
!rm output.json
!scrapy crawl pokemon_spider -o output.json

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
         " (For specifics on this Pokémon's evolution in the games, refer to "
         'Game data→Evolution data.)\n'
         ' \n'
         ' \n'
         ' \n'
         ' \n'
         ' \n'
         ' Baxcalibur may draw inspiration from kaiju. Its upright posture may '
         'allude to how most kaiju are humanoid in shape, due to the need to '
         'fit a human actor inside the creature suit in kaiju films. It also '
         'has traits of certain species of dinosaurs, most notably '
         'Concavenator, a theropod dinosaur which had a crest on its back and '
         'quills on its forelimbs. Fossils of the genus were first discovered '
         'in Spain, and the crest is hypothesized to have functioned as a tool '
         'of thermoregulation.\n'
         ' Baxcalibur may also be based on other sail-backed prehistoric '
         'animals, most notably Spinosaurus, Ouranosaurus, Arizonasaurus, and '
   

In [5]:
import pandas as pd

df = pd.read_json('output.json')
print(len(df))  # fits the "at least 1,000 articles" threshold
display(df)

1021


Unnamed: 0,url,name,text
0,https://bulbapedia.bulbagarden.net/wiki/Abomas...,Abomasnow,Abomasnow (Japanese: ユキノオー Yukinooh) is a dual...
1,https://bulbapedia.bulbagarden.net/wiki/Arctov...,Arctovish,Arctovish (Japanese: ウオチルドン Uochilldon) is a d...
2,https://bulbapedia.bulbagarden.net/wiki/Arctoz...,Arctozolt,Arctozolt (Japanese: パッチルドン Patchilldon) is a ...
3,https://bulbapedia.bulbagarden.net/wiki/Arroku...,Arrokuda,Arrokuda (Japanese: サシカマス Sasikamasu) is a Wat...
4,https://bulbapedia.bulbagarden.net/wiki/Ariado...,Ariados,Ariados (Japanese: アリアドス Ariados) is a dual-ty...
...,...,...,...
1016,https://bulbapedia.bulbagarden.net/wiki/Aegisl...,Aegislash,Aegislash (Japanese: ギルガルド Gillgard) is a dual...
1017,https://bulbapedia.bulbagarden.net/wiki/Aggron...,Aggron,Aggron (Japanese: ボスゴドラ Bossgodora) is a dual-...
1018,https://bulbapedia.bulbagarden.net/wiki/Aeroda...,Aerodactyl,Aerodactyl (Japanese: プテラ Ptera) is a dual-typ...
1019,https://bulbapedia.bulbagarden.net/wiki/Abra_(...,Abra,Abra (Japanese: ケーシィ Casey) is a Psychic-type ...


In [6]:
df.to_csv("pokemons_scrapped.csv")

# Preprocessing

Since we will be using sklearn `TfidfVectorizer` which already handles a lot of preprocessing (stopwords elimination, tokenization, etc.), we preprocess the data by simply filtering the scrapped text to only contain alphanumeric characters.

We tried applying stemming and lemmatization, but the problem is that the articles often contain a substantial number of pokemon-specific vocabulary, and these words ended up hurt by these methods. That is why we decided to keep things simple with only alphanumeric filter.

**NOTE** - we can relatively safely remove the Japanese characters, since there's always the English translation right next to them.



In [13]:
import re

def preprocess_text(text):
  text_alphanumeric = re.sub(r'[^A-Za-z0-9\s]', '', text)
  return {'alphanumeric': text_alphanumeric}

In [14]:
preprocessed_df = pd.concat([df, pd.DataFrame(list(df['text'].apply(preprocess_text)))], axis=1)
display(preprocessed_df)

Unnamed: 0,url,name,text,alphanumeric
0,https://bulbapedia.bulbagarden.net/wiki/Abomas...,Abomasnow,Abomasnow (Japanese: ユキノオー Yukinooh) is a dual...,Abomasnow Japanese Yukinooh is a dualtype Gra...
1,https://bulbapedia.bulbagarden.net/wiki/Arctov...,Arctovish,Arctovish (Japanese: ウオチルドン Uochilldon) is a d...,Arctovish Japanese Uochilldon is a dualtype W...
2,https://bulbapedia.bulbagarden.net/wiki/Arctoz...,Arctozolt,Arctozolt (Japanese: パッチルドン Patchilldon) is a ...,Arctozolt Japanese Patchilldon is a dualtype ...
3,https://bulbapedia.bulbagarden.net/wiki/Arroku...,Arrokuda,Arrokuda (Japanese: サシカマス Sasikamasu) is a Wat...,Arrokuda Japanese Sasikamasu is a Watertype P...
4,https://bulbapedia.bulbagarden.net/wiki/Ariado...,Ariados,Ariados (Japanese: アリアドス Ariados) is a dual-ty...,Ariados Japanese Ariados is a dualtype BugPoi...
...,...,...,...,...
1016,https://bulbapedia.bulbagarden.net/wiki/Aegisl...,Aegislash,Aegislash (Japanese: ギルガルド Gillgard) is a dual...,Aegislash Japanese Gillgard is a dualtype Ste...
1017,https://bulbapedia.bulbagarden.net/wiki/Aggron...,Aggron,Aggron (Japanese: ボスゴドラ Bossgodora) is a dual-...,Aggron Japanese Bossgodora is a dualtype Stee...
1018,https://bulbapedia.bulbagarden.net/wiki/Aeroda...,Aerodactyl,Aerodactyl (Japanese: プテラ Ptera) is a dual-typ...,Aerodactyl Japanese Ptera is a dualtype RockF...
1019,https://bulbapedia.bulbagarden.net/wiki/Abra_(...,Abra,Abra (Japanese: ケーシィ Casey) is a Psychic-type ...,Abra Japanese Casey is a Psychictype Pokmon i...


In [9]:
preprocessed_df.to_csv("pokemons_preprocessed.csv")

In [10]:
# saving project files
!zip -r /content/ir_project_1.zip /content/ir_project1_wiki_recommender
from google.colab import files
files.download("/content/ir_project_1.zip")

  adding: content/ir_project1_wiki_recommender/ (stored 0%)
  adding: content/ir_project1_wiki_recommender/output.json (deflated 67%)
  adding: content/ir_project1_wiki_recommender/pokemons_scrapped.csv (deflated 66%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/ (stored 0%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/items.py (deflated 37%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/middlewares.py (deflated 72%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/spiders/ (stored 0%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/spiders/pokemon_spider.py (deflated 64%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/spiders/__init__.py (deflated 27%)
  adding: content/ir_project1_wiki_recommender/ir_project1_wiki_recommender/spiders/.ipynb_checkpoints/ (stored 0%)
  adding: content/ir_project1_wiki_recommender

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Recommendation

For the recommendation system, we use sklearn `TfidfVectorizer` to calculate the document TF-IDF features vectors. It does the following:
- lowercase all words - this way mistakes in capitalization do not affect the algorithm
- remove english stop words - stop words such as "the", "like", "and", "a", etc. are irrelevant from this task point of view, therefore we can remove them for better performance and accuracy
- with `max_df=0.9, min_df=1`, we respectively:
  - remove words that appear in more than 90% documents - this intends to cut off corpus(pokemon)-specific stop words
  - remove words that appear in only a single document, since this doesn't help us in scoring them against each other
- Tokenization - tokenizing text into individual words
- Counting Term Frequencies (TF) - counting the number of times each term (word) appears
- Calculating Inverse Document Frequency (IDF) - IDF measures how important a term is across the entire corpus

Once we obtain the TF-IDF vectors, we compute the cosine similarity between the documents. Then the process of obtaining the recommendations, given pokemons to base the recommendation on, is as follows:
- let n be the number of top articles the user wants to be recommended
- let "rest of pokemons" be all the pokemons except the ones we base the recommendation on
- for all rest of pokemons obtain the individual scores (cosine similarities) to all pokemons we base the recommendation on
- calculate the combined score (similarity) using all individual scores - this is a simple mean of all individual scores
  - **NOTE** - the individual scores must be normalized to ensure all individual recommendations have equal contribution to the combined score, regardles of the actual individual score value - see [normalizing the score](#normalizing_combined) and`get_pokemon_recommendations_normalized`, the individual scores are normalized
- select top n articles according to the combined score - this is our recommendation

<a name="loading_preprocessed"></a>
## NOTE - Loading scrapped and preprocessed data
If you want to use the recommender without running the scrapper, you may upload the preprocessed data from the csv (available on GitHub):

In [12]:
import pandas as pd
preprocessed_df = pd.read_csv('pokemons_preprocessed.csv')  # upload to runtime
display(preprocessed_df)

Unnamed: 0.1,Unnamed: 0,url,name,text,alphanumeric
0,0,https://bulbapedia.bulbagarden.net/wiki/Abomas...,Abomasnow,Abomasnow (Japanese: ユキノオー Yukinooh) is a dual...,Abomasnow Japanese Yukinooh is a dualtype Gra...
1,1,https://bulbapedia.bulbagarden.net/wiki/Arctov...,Arctovish,Arctovish (Japanese: ウオチルドン Uochilldon) is a d...,Arctovish Japanese Uochilldon is a dualtype W...
2,2,https://bulbapedia.bulbagarden.net/wiki/Arctoz...,Arctozolt,Arctozolt (Japanese: パッチルドン Patchilldon) is a ...,Arctozolt Japanese Patchilldon is a dualtype ...
3,3,https://bulbapedia.bulbagarden.net/wiki/Arroku...,Arrokuda,Arrokuda (Japanese: サシカマス Sasikamasu) is a Wat...,Arrokuda Japanese Sasikamasu is a Watertype P...
4,4,https://bulbapedia.bulbagarden.net/wiki/Ariado...,Ariados,Ariados (Japanese: アリアドス Ariados) is a dual-ty...,Ariados Japanese Ariados is a dualtype BugPoi...
...,...,...,...,...,...
1016,1016,https://bulbapedia.bulbagarden.net/wiki/Aegisl...,Aegislash,Aegislash (Japanese: ギルガルド Gillgard) is a dual...,Aegislash Japanese Gillgard is a dualtype Ste...
1017,1017,https://bulbapedia.bulbagarden.net/wiki/Aggron...,Aggron,Aggron (Japanese: ボスゴドラ Bossgodora) is a dual-...,Aggron Japanese Bossgodora is a dualtype Stee...
1018,1018,https://bulbapedia.bulbagarden.net/wiki/Aeroda...,Aerodactyl,Aerodactyl (Japanese: プテラ Ptera) is a dual-typ...,Aerodactyl Japanese Ptera is a dualtype RockF...
1019,1019,https://bulbapedia.bulbagarden.net/wiki/Abra_(...,Abra,Abra (Japanese: ケーシィ Casey) is a Psychic-type ...,Abra Japanese Casey is a Psychictype Pokmon i...


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def display_all_pokemons():
  """
  Displays all pokemons and links to their respective wiki pages
  """
  display(preprocessed_df[['name', 'url']])


def get_index_of_pokemon(pokemon: str):
  try:
      return preprocessed_df[preprocessed_df['name'] == pokemon].index[0]
  except IndexError:
    print(f"{pokemon} not found in the database. Please check the available pokemons using display_all_pokemons() function.")
    return None


def get_pokemon_name(index: int):
  try:
    return preprocessed_df.loc[index, 'name']
  except KeyError:
    print(f"Pokemon at index {index} not found in the database. Please check the available pokemons using display_all_pokemons() function.")
    return None


def pretty_print_similarity(similarity: float):
  return f'{similarity*100:.2f}%'


# 0.9 - try cut out corpus-specific stopwords
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=1)
document_vectors = vectorizer.fit_transform(preprocessed_df['alphanumeric'])

cosine_sim_matrix = cosine_similarity(document_vectors, document_vectors)
pd.DataFrame(cosine_sim_matrix).to_csv('cosine_sim_matrix.csv')

all_pokemons = set(preprocessed_df['name'])
all_pokemons_indices = {get_index_of_pokemon(pokemon) for pokemon in all_pokemons}


def get_pokemon_recommendations(pokemons: set, top_n_recommendations: int = 5):
  """
  Given a set of pokemons from https://bulbapedia.bulbagarden.net/wiki/
  the user already read about, recommends top n similar pokemons to read about
  next
  """
  valid_pokemons = all_pokemons.intersection(pokemons)
  for pokemon in pokemons:
    if pokemon not in valid_pokemons:
      print(f'Pokemon "{pokemon}" not found in bulbapedia.\nYou may view all available pokemons by using display_all_pokemons() function.')
  if len(valid_pokemons) == 0:
    print('No valid pokemons to base the recommendation on. Please check the available pokemons using display_all_pokemons() function, adjust the pokemons, and try again.')
    return
  print(f'Valid pokemons to base the recommendation on: {valid_pokemons}')

  valid_pokemons_indexes = {get_index_of_pokemon(pokemon) for pokemon in valid_pokemons}
  pokemons_to_compare_indexes = all_pokemons_indices - valid_pokemons_indexes

  # calculate individual scores
  individual_scores = dict()
  for pokemon in pokemons_to_compare_indexes:
    individual_scores[pokemon] = dict()
    for input_pokemon in valid_pokemons_indexes:
      individual_scores[pokemon][input_pokemon] = cosine_sim_matrix[pokemon][input_pokemon]

  # calculate combined score - simple mean of all scores
  combined_scores = {pokemon: sum(individual_scores[pokemon].values())/len(valid_pokemons_indexes) for pokemon in pokemons_to_compare_indexes}

  # select top n
  sorted_scores = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:top_n_recommendations]

  # display top n, (with combined and individual scores), display snippet (first 50 chars?) of raw text, return dataframe???
  print(f'\nTop {top_n_recommendations} recommendations for pokemons to read about next, given you already read about {valid_pokemons}:')
  for recommended_pokemon_index, recommended_pokemon_combined_score in sorted_scores:
    print(f'***** *** Pokemon: {get_pokemon_name(recommended_pokemon_index)}')
    print(f'Similarity to all pokemons you read about: {pretty_print_similarity(recommended_pokemon_combined_score)}')
    print(f'Similarity to these pokemons individually:')
    for pokemon_index in valid_pokemons_indexes:
      print(f'- {get_pokemon_name(pokemon_index)}: {pretty_print_similarity(individual_scores[recommended_pokemon_index][pokemon_index])}')
    print(f'Snippet of the article on bulbapedia:\n{preprocessed_df.iloc[recommended_pokemon_index]["text"][:200]}...')
    print(f'Read the article here - url: {preprocessed_df.iloc[recommended_pokemon_index]["url"]}')
    print()

  return {get_pokemon_name(index): score for index, score in sorted_scores}

<a name="normalizing_combined"></a>
The problem with this approach is that the score is non-normalized and a pokemon with relatively high similarity and a pokemon with relatively low one may throw the recommendations off.

Let A, B, C, D, E be all pokemons. Consider the following example scores for pokemons C, D, E, given we base the recommendation on A, B:
- pokemon C - A: 0.74, B: 0.08
- pokemon D - A: 0.32, B: 0.32
- pokemon E - A: 0.24, B: 0.24

The scores would be as follows:
- pokemon C - 0.41
- pokemon D - 0.32
- pokemon E - 0.24

Pokemon C would score first, despite pokemon D and E score 4 and 3 times higher on pokemon B. If we normalized (rounded to :.2f precision):
- pokemon C - A: 1.00, B: 0.25
- pokemon D - A: 0.43, B: 1.00
- pokemon E - A: 0.32, B: 0.75

The scores are as follows:
- pokemon C - 0.625
- pokemon D - 0.715
- pokemon E - 0.535

Pokemon D would score first now, and pokemon E would not fall as much short from pokemon C. Normalizing the scores helps make the recommendation more just, all individual scores contribute equally.

In [45]:
def get_pokemon_recommendations_normalized(pokemons: set, top_n_recommendations: int = 5):
  """
  Given a set of pokemons from https://bulbapedia.bulbagarden.net/wiki/
  the user already read about, recommends top n similar pokemons to read about
  next
  """
  valid_pokemons = all_pokemons.intersection(pokemons)
  for pokemon in pokemons:
    if pokemon not in valid_pokemons:
      print(f'Pokemon "{pokemon}" not found in bulbapedia.\nYou may view all available pokemons by using display_all_pokemons() function.')
  if len(valid_pokemons) == 0:
    print('No valid pokemons to base the recommendation on. Please check the available pokemons using display_all_pokemons() function, adjust the pokemons, and try again.')
    return
  print(f'Valid pokemons to base the recommendation on: {valid_pokemons}')

  valid_pokemons_indexes = {get_index_of_pokemon(pokemon) for pokemon in valid_pokemons}
  pokemons_to_compare_indexes = all_pokemons_indices - valid_pokemons_indexes

  # calculate individual scores
  individual_scores = dict()
  for pokemon in pokemons_to_compare_indexes:
    individual_scores[pokemon] = dict()
    for input_pokemon in valid_pokemons_indexes:
      individual_scores[pokemon][input_pokemon] = cosine_sim_matrix[pokemon][input_pokemon]

  # normalize
  scores_df = pd.DataFrame.from_dict(individual_scores, orient='index')
  for pokemon in valid_pokemons_indexes:
    max_value = scores_df[pokemon].max()
    scores_df[pokemon] = scores_df[pokemon] / max_value

  # calculate combined score - simple mean of all scores
  scores_df['combined'] = scores_df.mean(axis=1)

  # select top n
  sorted = scores_df.sort_values(by='combined', ascending=False)
  selected = sorted.head(top_n_recommendations)

  # display top n, (with combined and individual scores), display snippet (first 50 chars?) of raw text, return dataframe???
  print(f'\nTop {top_n_recommendations} recommendations for pokemons to read about next, given you already read about {valid_pokemons}:\n')
  for recommended_pokemon_index, row in selected.iterrows():
    print(f'***** *** Pokemon: {get_pokemon_name(recommended_pokemon_index)}')
    print(f'Combined score given all pokemons you read about: {pretty_print_similarity(row["combined"])}')
    print(f'Similarity to these pokemons individually:')
    for pokemon_index in valid_pokemons_indexes:
      print(f'- {get_pokemon_name(pokemon_index)}: {pretty_print_similarity(individual_scores[recommended_pokemon_index][pokemon_index])}')
    print(f'Snippet of the article on bulbapedia:\n{preprocessed_df.iloc[recommended_pokemon_index]["text"][:200]}...')
    print(f'Read the article here - url: {preprocessed_df.iloc[recommended_pokemon_index]["url"]}')
    print()

  return selected

# Using the recommender

## Search all pokemons

In [17]:
print("NOTE - in colab, useful to Convert to interactive table")
display_all_pokemons()

NOTE - in colab, useful to Convert to interactive table


Unnamed: 0,name,url
0,Abomasnow,https://bulbapedia.bulbagarden.net/wiki/Abomas...
1,Armarouge,https://bulbapedia.bulbagarden.net/wiki/Armaro...
2,Arrokuda,https://bulbapedia.bulbagarden.net/wiki/Arroku...
3,Aurorus,https://bulbapedia.bulbagarden.net/wiki/Auroru...
4,Aromatisse,https://bulbapedia.bulbagarden.net/wiki/Aromat...
...,...,...
1016,Aggron,https://bulbapedia.bulbagarden.net/wiki/Aggron...
1017,Aegislash,https://bulbapedia.bulbagarden.net/wiki/Aegisl...
1018,Aipom,https://bulbapedia.bulbagarden.net/wiki/Aipom_...
1019,Accelgor,https://bulbapedia.bulbagarden.net/wiki/Accelg...


## Use the recommender system

In [54]:
# Adjust the set of pokemons based off which you want the recommendations
pokemons = {
    'Pikachu',
    'Geodude',
    'Diglett'
}
# Adjust the number of pokemons to be recommended
pokemons_to_recommend = 5

recommendations = get_pokemon_recommendations_normalized(pokemons, pokemons_to_recommend)

Valid pokemons to base the recommendation on: {'Diglett', 'Pikachu', 'Geodude'}

Top 5 recommendations for pokemons to read about next, given you already read about {'Diglett', 'Pikachu', 'Geodude'}:

***** *** Pokemon: Graveler
Combined score given all pokemons you read about: 44.31%
Similarity to these pokemons individually:
- Geodude: 18.07%
- Pikachu: 3.85%
- Diglett: 4.55%
Snippet of the article on bulbapedia:
Graveler (Japanese: ゴローン Golone) is a dual-type Rock/Ground Pokémon introduced in Generation I.
 It evolves from Geodude starting at level 25 and evolves into Golem when traded or when exposed to a Li...
Read the article here - url: https://bulbapedia.bulbagarden.net/wiki/Graveler_(Pok%C3%A9mon)

***** *** Pokemon: Golem
Combined score given all pokemons you read about: 43.39%
Similarity to these pokemons individually:
- Geodude: 18.71%
- Pikachu: 3.16%
- Diglett: 3.82%
Snippet of the article on bulbapedia:
Golem (Japanese: ゴローニャ Golonya) is a dual-type Rock/Ground Pokémon i

## Evaluation of the system - expert view

TODO

In [58]:
get_pokemon_recommendations_normalized({'Chikorita', 'Bulbasaur', 'Tepig'}, 10)

Valid pokemons to base the recommendation on: {'Tepig', 'Chikorita', 'Bulbasaur'}

Top 10 recommendations for pokemons to read about next, given you already read about {'Tepig', 'Chikorita', 'Bulbasaur'}:

***** *** Pokemon: Bayleef
Combined score given all pokemons you read about: 37.35%
Similarity to these pokemons individually:
- Chikorita: 44.51%
- Bulbasaur: 2.83%
- Tepig: 1.64%
Snippet of the article on bulbapedia:
Bayleef (Japanese: ベイリーフ Bayleaf) is a Grass-type Pokémon introduced in Generation II.
 It evolves from Chikorita starting at level 16 and evolves into Meganium starting at level 32.
 Bayleef is a qua...
Read the article here - url: https://bulbapedia.bulbagarden.net/wiki/Bayleef_(Pok%C3%A9mon)

***** *** Pokemon: Ivysaur
Combined score given all pokemons you read about: 37.10%
Similarity to these pokemons individually:
- Chikorita: 2.56%
- Bulbasaur: 37.00%
- Tepig: 2.06%
Snippet of the article on bulbapedia:
Ivysaur (Japanese: フシギソウ Fushigisou) is a dual-type Grass/P

Unnamed: 0,902,942,262,combined
2,1.0,0.076412,0.04397,0.373461
655,0.057609,1.0,0.05531,0.370973
455,0.037528,0.043995,1.0,0.360508
330,0.111692,0.124088,0.5972,0.27766
805,0.032866,0.050465,0.748869,0.2774
198,0.056813,0.61027,0.039791,0.235625
492,0.080262,0.120474,0.460287,0.220341
236,0.35826,0.181711,0.096134,0.212035
310,0.092154,0.427223,0.09061,0.203329
27,0.327192,0.149618,0.10361,0.193473


In [55]:
get_pokemon_recommendations_normalized({'Pikachu'}, 10)

Valid pokemons to base the recommendation on: {'Pikachu'}

Top 10 recommendations for pokemons to read about next, given you already read about {'Pikachu'}:

***** *** Pokemon: Raichu
Combined score given all pokemons you read about: 100.00%
Similarity to these pokemons individually:
- Pikachu: 20.24%
Snippet of the article on bulbapedia:
Raichu (Japanese: ライチュウ Raichu) is an Electric-type Pokémon introduced in Generation I.
 It evolves from Pikachu when exposed to a Thunder Stone. It is the final form of Pichu.
 In Alola, Raichu has a...
Read the article here - url: https://bulbapedia.bulbagarden.net/wiki/Raichu_(Pok%C3%A9mon)

***** *** Pokemon: Pichu
Combined score given all pokemons you read about: 59.78%
Similarity to these pokemons individually:
- Pikachu: 12.10%
Snippet of the article on bulbapedia:
Pichu (Japanese: ピチュー Pichu) is an Electric-type baby Pokémon introduced in Generation II.
 It evolves into Pikachu when leveled up with high friendship, which evolves into Raichu wh

Unnamed: 0,458,combined
414,1.0,1.0
466,0.597777,0.597777
870,0.397693,0.397693
492,0.392383,0.392383
817,0.384781,0.384781
918,0.373152,0.373152
537,0.373059,0.373059
330,0.351129,0.351129
524,0.340688,0.340688
545,0.339208,0.339208


In [56]:
get_pokemon_recommendations_normalized({'Pichu', 'Pikachu', 'Raichu'}, 10)

Valid pokemons to base the recommendation on: {'Raichu', 'Pikachu', 'Pichu'}

Top 10 recommendations for pokemons to read about next, given you already read about {'Raichu', 'Pikachu', 'Pichu'}:

***** *** Pokemon: Wynaut
Combined score given all pokemons you read about: 63.77%
Similarity to these pokemons individually:
- Pichu: 10.47%
- Pikachu: 5.32%
- Raichu: 1.69%
Snippet of the article on bulbapedia:
Wynaut (Japanese: ソーナノ Sohnano) is a Psychic-type baby Pokémon introduced in Generation III.
 It evolves into Wobbuffet starting at level 15.
 Wynaut is a small, bipedal Pokémon covered in blue fur. I...
Read the article here - url: https://bulbapedia.bulbagarden.net/wiki/Wynaut_(Pok%C3%A9mon)

***** *** Pokemon: Exeggutor
Combined score given all pokemons you read about: 60.16%
Similarity to these pokemons individually:
- Pichu: 2.25%
- Pikachu: 4.75%
- Raichu: 6.68%
Snippet of the article on bulbapedia:
Exeggutor (Japanese: ナッシー Nassy) is a dual-type Grass/Psychic Pokémon introduced

Unnamed: 0,466,458,414,combined
131,1.0,0.660281,0.252811,0.637697
797,0.214865,0.589984,1.0,0.601616
545,0.268184,0.852938,0.672662,0.597928
817,0.342692,0.967532,0.418427,0.576217
902,0.62593,0.746032,0.287602,0.553188
236,0.5107,0.80354,0.313167,0.542469
310,0.28665,0.712425,0.625439,0.541505
892,0.376256,0.785435,0.419317,0.527002
28,0.428088,0.587593,0.558332,0.524671
492,0.271096,0.986646,0.313779,0.52384


# Interesting properties of the data

In [None]:
# todo - (most frequent words, histograms, similarities between documents, ...))

In [21]:
# todo - very big matrix -> takes a lot of time, keeping for reference

!pip install fastcluster
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.clustermap(cosine_sim_matrix, cmap='viridis', annot=True, fmt=".2f", xticklabels=df.index, yticklabels=df.index)
plt.title('Cosine Similarity Matrix')
plt.setp(g.ax_heatmap.get_yticklabels(), rotation=0)
plt.show()

Collecting fastcluster
  Downloading fastcluster-1.2.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/194.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m153.6/194.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.0/194.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3108, in _dep_map
    return self.__dep_map
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2901, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, ano

NameError: ignored