<div class="alert alert-info">
    
‚û°Ô∏è Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).

‚û°Ô∏è If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).

‚û°Ô∏è Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**.  You normally shouldn't need to modify any of the other cells.

</div>

# L1: Information Retrieval

In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.

In [51]:
# Define some helper functions that are used in this notebook
from IPython.display import display, HTML

def success():
    display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))

## Dataset

The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).

In [52]:
import bz2
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 500)

with bz2.open('app-descriptions.json.bz2', mode='rt', encoding='utf-8') as source:
    df = pd.read_json(source, encoding='utf-8')

In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200‚Äì204:

In [53]:
df.loc[200:205]

Unnamed: 0,name,description
200,Brick Breaker Star: Space King,"Introducing the best Brick Breaker game that everyone can enjoy.\nEnjoy various missions and addictively simple play control.\n\n[Features]\n- Hundreds of stages and various missions\n- No limit to play such as Heart, play as much as you can!\n- 5 kinds of various items and items reinforcement system\n- No network required\n- game file is as low as 20M, light-weight download!\n- supports tablet screen\n- supports Google Play Leaderboards, Achievement, Multiplay\n- supports 14 languages\n\nHo..."
201,Brick Classic - Brick Game,"Classic Brick Game!\n\nBrick Classic is a popular and addictive puzzle game!\n\nHow to play?\n- Simply drag the bricks to move them.\n- Create full lines on the grid vertically or horizontally to break bricks.\n\nTips:\n- Classic brick game without time limits.\n- Place the bricks in a reasonable position.\n- The more brick break, the more scores you have.\n- Bricks can't be rotated.\n\nWho's the best brick breaker? Challenge it now!!!"
202,Bricks Breaker - Glow Balls,"Bricks Breaker - Glow Balls is a addictive and challenging brick game.\nJust play it to relax your brain. Be focus on breaking bricks and you will find it more funny and exciting.\n\nHow to play\n- Hold the screen with your finger and move to aim.\n- Find best positions and angles to hit all bricks.\n- When the durability of brick reaches 0, destroyed.\n- Never let bricks reach the bottom or game is over.\n\nFeatures\n- Colorful glow skins.\n- Free to play.\n- Easy game controls with one fin..."
203,Bricks Breaker Quest,"How to play\n- The ball flies to wherever you touched.\n- Clear the stages by removing bricks on the board.\n- Break the bricks and never let them hit the bottom.\n- Find best positions and angles to hit every brick.\n\nFeature\n- Free to play\n- Tons of stages\n- Various types of balls\n- Easy to play, Simplest game system, Designed for one handheld gameplay.\n- Off-line (without internet connection) gameplay supported \n- Multi-play supported\n- Tablet device supported\n- Achievement & lea..."
204,Brothers in Arms¬Æ 3,"Fight brave soldiers from around the globe on the frenzied multiplayer battlegrounds of World War 2 or become Sergeant Wright and experience a dramatic, life-changing single-player journey, in the aftermath of the D-Day invasion.\n\nCLIMB THE ARMY RANKS IN MULTIPLAYER \n> 4 maps to master and enjoy. \n> 2 gameplay modes to begin with: Free For All and Team Deathmatch.\n> Unlock game-changing perks by playing with each weapon class!\n> A soldier‚Äôs only as deadly as his weapon. Be sure to upgr..."
205,Brown Dust - Tactical RPG,"The Empire has fallen, and the Age of Great Mercenaries Now Begins!\nCreate Your Ultimate Team And Strike Down Your Enemies!\n\nCAPTIVATING AND STUNNING ARTWORK\n- Experience the high-quality anime illustrations you have never seen before.\n- Meet Brown Dust's charming Mercenaries now.\n\nASSEMBLE LEGENDARY MERCENARIES\n- Over 300 Mercenaries and a Variety of Skills.\n- Discover the Unique Mercenaries, 6 Devils and Dominus Octo.\n- All Mercenaries can reach max level and the highest rank.\n\..."


As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The next cell shows how to access only the description field from row 200:

In [54]:
df.loc[200, 'description']

'Introducing the best Brick Breaker game that everyone can enjoy.\nEnjoy various missions and addictively simple play control.\n\n[Features]\n- Hundreds of stages and various missions\n- No limit to play such as Heart, play as much as you can!\n- 5 kinds of various items and items reinforcement system\n- No network required\n- game file is as low as 20M, light-weight download!\n- supports tablet screen\n- supports Google Play Leaderboards, Achievement, Multiplay\n- supports 14 languages\n\nHomepage:\nhttps://play.google.com/store/apps/dev?id=4931745640662708567\n\nFacebook: \nhttps://www.facebook.com/spcomesgames/'

## Problem 1: What's in a vector?

We start by vectorising the data ‚Äî more specifically, we map each app description to a tf‚Äìidf vector. This is very simple with a library like [scikit-learn](https://scikit-learn.org/stable/), which provides a [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class for exactly this purpose.  If we instantiate this class, and call `fit_transform()` on all of our app descriptions, scikit-learn will preprocess and tokenize each app description, compute tf‚Äìidf values for each of them, and return a vectorised representation:

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['description'])
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 267110 stored elements and shape (1614, 27877)>

Let‚Äôs pick the app "Pancake Tower", which has a rather short description text, to see how it has been vectorised:

In [56]:
# We can use 'toarray' to convert the sparse matrix object into a "normal" array
vec = X[1032].toarray()[0]

# The app description & its corresponding vector
df.loc[1032, 'description'], vec

("Let's see how many pancakes you can pile up!!",
 array([0., 0., 0., ..., 0., 0., 0.], shape=(27877,)))

That's not very informative yet.  We know that the vector contains tf‚Äìidf values, and that each dimension of the vector corresponds to a token in the vectorizer‚Äôs vocabulary; let's extract these for this specific example.

Your **first task** is to find out how to access the `vectorizer`‚Äôs vocabulary, for example by [checking the documentation of `TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and print all the tokens that are represented in the vector with a tf‚Äìidf value greater than zero (i.e., only the tokens that are actually part of this app‚Äôs description) _in descending order of the tf‚Äìidf values_.  In other words, the token with the highest tf‚Äìidf value should be at the top of your output, and the token with the lowest tf‚Äìidf value at the bottom.   Before you implement this, think about what you would expect the output look like, for example which words you would expect to have the highest/lowest tf‚Äìidf values in this example.

Your final output should look something like this:

```
<token 1>: <tf-idf value 1>
<token 2>: <tf-idf value 2>
...
```

In [57]:
"""Print the tokens and their tf‚Äìidf values, in descending order."""

# YOUR CODE HERE
feature_names = vectorizer.get_feature_names_out()
print(f'feature names: {feature_names}')

dense = X[1032].toarray().flatten()
print(f'dense: {dense}')

mydf = pd.DataFrame({'term': feature_names, 'tfidf': dense})
mydf = mydf[mydf.tfidf > 0].sort_values('tfidf', ascending=False)
print(mydf)

feature names: ['00' '000' '0000' ... 'Ô¨Årst' 'Ô¨Çip' 'Ô¨Çying']
dense: [0. 0. 0. ... 0. 0. 0.]
           term     tfidf
15455  pancakes  0.653933
15954      pile  0.530470
12300       let  0.261529
18684       see  0.255763
13059      many  0.234920
10230       how  0.211532
22697        up  0.172168
3535        can  0.130476
24136       you  0.102769


## Problem 2: Finding the nearest vectors

To build a small search engine, we need to be able to turn _queries_ (for example the string "pile up pancakes") into _query vectors_, and then find out which of our app description vectors are closest to the query vector.

For the first part (turning queries into query vectors), we can simply re-use the `vectorizer` that we used for the app descriptions. For the second part, an easy way to find the closest vectors is to use scikit-learn‚Äôs [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. This class needs to be _fit_ on a set of vectors (the "training set"; in our case the app descriptions) and can then be used with any vector to find its _nearest neighbors_ in the vector space.

**First,** instantiate and fit a class that returns the _ten (10)_ nearest neighbors:

In [58]:
"""Instantiate and fit a class that returns the 10 nearest neighboring vectors."""

# YOUR CODE HERE
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=10)
nn.fit(X=X)

0,1,2
,n_neighbors,10
,radius,1.0
,algorithm,'auto'
,leaf_size,30
,metric,'minkowski'
,p,2
,metric_params,
,n_jobs,


**Second,** implement a function that uses the vectorizer and the fitted class to find the nearest neighbours for a given query string:

In [60]:
def search(query):
    """Find the nearest neighbors in `df` for a query string.

    Arguments:
      query (str): A query string.

    Returns:
      The 10 apps (with name and description) most similar (in terms of
      cosine similarity) to the given query as a Pandas DataFrame.
    """
    
    # YOUR CODE HERE
  
    v_q = vectorizer.transform([query])
    ns = nn.kneighbors(v_q, return_distance=False)[0]
    return df.iloc[np.array(ns)]
    
  
# r = search('pile up pancakes')
# print(np.shape(r))
# print(r)
# print(np.(r)[:, 0, :])



### ü§û Test your code

Test your implementation by running the following cell, which will sanity-check your return value and show the 10 best search results for the query _"pile up pancakes"_:

In [61]:
"""Check that searching for "pile up pancakes" returns a DataFrame with ten results,
   and that the top result is "Pancake Tower"."""

result = search('pile up pancakes')
display(result)
assert isinstance(result, pd.DataFrame), "search() function should return a Pandas DataFrame"
assert len(result) == 10, "search() function should return 10 search results"
assert result.iloc[0]["name"] == "Pancake Tower", "Top search result should be 'Pancake Tower'"
success()

Unnamed: 0,name,description
1032,Pancake Tower,Let's see how many pancakes you can pile up!!
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We..."
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe..."
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d..."
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ..."
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i..."
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki..."
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi..."
1446,UNO!‚Ñ¢,"Play the world‚Äôs number one card game like never before. UNO!‚Ñ¢ has all-new rules, tournaments, adventures and so much more! At home or on the move, jump into games instantly. Whether an UNO!‚Ñ¢ veteran or completely new, take on challenges and reap the rewards. UNO!‚Ñ¢ is the ultimate competitive family-friendly card game.\n- Play classic UNO!‚Ñ¢ or use tons of popular house rules!\n- Connect anytime, anywhere with friends from around the world! \n- Two heads are better than one in 2v2 mode. Use t..."
1326,TO-FU Oh!SUSHI,"You are the veritable sushi master! Prepare your own fun sushi with ‚ÄúDaizu‚Äù the skunk!\n\nThis app is designed to allow children to be creative by decorating their original sushi.\n\nServe your delicious, mysterious or impossible sushi to the people of ‚ÄúTofu Island‚Äù! \n\nHow about creating sushi that is totally original and serve it to your beloved guests? Spice it up with tons of wasabi or even sprinkle chocolate and gummy bears for those sweet lovers.\nFeel free to make any kind of sushi y..."


Before continuing with the next problem, play around a bit with this simple search functionality by trying out different search queries, and see if the results look like what you would expect:

In [68]:
# Example ‚Äî try out your own queries!
search("christmas travel")

Unnamed: 0,name,description
661,Hey Duggee: The Tinsel Badge,"Hey Duggee: The Tinsel Badge is the brand new official app for fans of the show and it‚Äôs FREE!\n\nChristmas is coming but the Clubhouse isn‚Äôt looking very Christmassy! Help the Squirrels earn their Tinsel Badge by decorating Duggee‚Äôs Christmas tree. \n\nFeatures:\n\nUsing simple drag-and-drop, tapping and swiping motions:\n‚Ä¢ Cover the tree in tinsel\n‚Ä¢ Hang baubles, bow ties, snowflakes, stars, candy canes and more\n‚Ä¢ Place a special Christmas Squirrel at the very top of the tree\n‚Ä¢ Finally,..."
748,Kids Animals Jigsaw Puzzles ‚ù§Ô∏èü¶Ñ,"If your preschool kids like jigsaw puzzles, they will LOVE Super Puzzle!\n\nSuper Puzzle works almost like a real jigsaw puzzle for kids. When you select a puzzle piece it stays on the board even if you place it incorrectly, and you can move the puzzle piece until it slides to the correct place. \n\nEach relaxing puzzle features a beautiful image drawn by a professional artist and a unique reward when the jigsaw puzzle is completed. The images include things like unicorns, dragons, or dinosa..."
310,Coloring and drawing for kids,"Coloring and drawing for kids contains 128 coloring pages for toddlers to enjoy. Drawing app is perfect for girls and boys ages 2 to 8. Painting game helps babies to develop creativity, fine motor skills and hand eye coordination.\n\nDoodle game for kids features:\n\n- 128 coloring pages for toddlers.\n- Drawing app for kids features 8 themes: animals, princesses, cars, school, musical instruments, food, Halloween and Christmas.\n- Baby doodle games for kids with simple interface.\n- 16 colo..."
848,MAGICA TRAVEL AGENCY ‚Äì Free Match 3 Puzzle Game,"Travel agency ""Magica"" - can immerse you in an atmosphere of unforgettable adventures! The detailed worlds with unique puzzles.You will travel to new worlds on balloons and original gameplay will not leave indifferent any fan of the popular genre Match 3. Also you can play with your friends and share their lives with them. During the game obstacles will try to stop you, but it does not matter, as a rule, you just need to help a local resident and he will disrupt the obstacle and you can cont..."
1208,Shuriken Jump,Travel with the shuriken as high as possible! This jumping upwards game will test your reaction speed and hopping skills. You will be challenged to leap from beam to beam while slicing fruits.
311,Coloring book for kids,"Coloring book for kindergarten kids and toddlers. The app has 120 pictures for coloring that will keep your child entertained while developing creativity, fine motor skills and hand eye coordination. Our coloring game is great for both girls and boys of all ages and interests. It allows kids to color animals, dinosaurs, princesses, transport, aliens, sea creatures, robots and even Christmas pictures.\n\nDrawing game with different instruments ‚Äì pencil, brush, spray, crayon, felt-tip pen and ..."
1357,The Enchanted Worlds,"In this latest adventure with Uncle Henry, he has had a secret kept for many years that he now wishes to share with you. Over his travels he has discovered three enchanted books that transport you to the worlds written on their pages when using a special amulet for each. He has just learned that there is a fourth book hidden within one of these worlds! He asks for your help in search of this fourth book. You must go to his house and use the clues and puzzles he has placed to locate and explo..."
570,Fruits Mania : Elly‚Äôs travel,"The match-3 puzzle game that‚Äôll make you go bananas! \nFruits Mania : Elly‚Äôs travel is a highly addictive and delicious match-3 puzzle game!\nJoin Elly the elephant as she collects all kinds of fruits for her little brother and sister.\n\nHOW TO PLAY\n‚Ä¢ Swipe and match 3 or more fruits.\n‚Ä¢ Collect the fruits to complete each level!\n‚Ä¢ Use colorful and powerful boosters to help you out!\n‚Ä¢ Enjoy other various missions, like defeating the crocodile and catching the mouse!\n‚Ä¢ Achieve 3 stars to..."
129,Basketball Battle,It's arcade style basketball battling fun! \n\n** NEW SPOOKY HALLOWEEN EVENTS **\n\nYOUR MOVES:\n‚òÖ Easy moving and shooting!\n‚òÖ Dunk on people!\n‚òÖ Nasty cross-overs and step-backs!\n\nYOUR COMPETITION:\n‚òÖ Travel the country competing in tournaments!\n‚òÖ Over 100 unique basketball courts!\n‚òÖ Compete in online LIVE EVENTS!\n\nYOUR TEAM:\n‚òÖ Customize your look and upgrade your players!\n\nYOUR FRIENDS:\n‚òÖ Challenge friends in 2 player split screen!\n\n\nDon't miss out on the sports game everyone...
633,Gunspell - Match 3 Puzzle RPG,"It is a story-driven rpg-adventure where guns and magic act together! \nBecome a member of this powerful Order whose mission it is to protect Earth. Travel through other worlds, contend with monsters, upgrade your weapons and enhance your magic. Have fun!\nIt is a completely free game with an option to purchase packs of in-game currency or a premium account.\n\n‚Ä¢ Match 3 battles with a lot of features\n‚Ä¢ Multiple strange new worlds to explore\n‚Ä¢ Hordes of enemies to fight\n‚Ä¢ Tons of differen..."


## Problem 3: Custom preprocessing & tokenization

In Problem 1, you should have seen that `TfidfVectorizer` already performs some preprocessing by default and also does its own tokenization of the input data. This is great for getting started, but often we want to have more control over these steps. We can customize some aspects of the preprocessing through arguments when instantiating `TfidfVectorizer`, but for this exercise, we want to do _all_ of our preprocessing & tokenizing outside of scikit-learn.

Concretely, we want to use [spaCy](https://spacy.io), a library that we will make use of in later labs as well.  Here is a brief example of how to load and use a spaCy model:

In [70]:
import spacy
# Load the small English model, disabling some components that we don't need right now
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

# Take an example sentence and print every token from it separately
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


**Your task** is to write a preprocessing function that uses spaCy to perform the following steps:
- tokenization
- lemmatization
- stop word removal
- removing tokens containing non-alphabetical characters

We recommend that you go through the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy&nbsp;101, which demonstrates how you can get the relevant kind of information via the spaCy library.

Implement your preprocessor by completing the following function:

In [None]:
def preprocess(text):
    """Preprocess the given text by tokenising it, removing any stop words, 
    replacing each remaining token with its lemma (base form), and discarding 
    all lemmas that contain non-alphabetical characters.

    Arguments:
      text (str): The text to preprocess.

    Returns:
      The list of remaining lemmas after preprocessing (represented as strings).
    """
    # YOUR CODE HERE
    doc = nlp(text)
    
    # print(f"{'TEXT':<15}{'LEMMA':<15}{'POS':<8}{'TAG':<10}{'DEP':<10}{'SHAPE':<10}{'IS_ALPHA':<10}{'IS_STOP':<10}{'CUSTOM_ALPHA':<10}")
    # print("-"*90)
    # for token in doc:
    #     print(f"{token.text:<15}{token.lemma_:<15}{token.pos_:<8}{token.tag_:<10}{token.dep_:<10}"
    #           f"{token.shape_:<10}{str(token.is_alpha):<10}{str(token.is_stop):<10}{str(is_alpha(token.text)):<10}")
    res = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return res

    # raise NotImplementedError()
  
# doc = preprocess("Apple is looking at buying U.K. startup for $1 billion")
# print(doc)

['Apple', 'look', 'buy', 'startup', 'billion']


### ü§û Test your code

Test your implementation by running the following cell:

In [78]:
"""Check that the preprocessing returns the correct output for a number of test cases."""

assert (
    preprocess('Apple is looking at buying U.K. startup for $1 billion') ==
    ['Apple', 'look', 'buy', 'startup', 'billion']
)
assert (
    preprocess('"Love Story" is a country pop song written and sung by Taylor Swift.') ==
    ['Love', 'Story', 'country', 'pop', 'song', 'write', 'sing', 'Taylor', 'Swift']
)
success()

## Problem 4: The effect of preprocessing

To make use of the new `preprocess` function from Problem 3, we need to make sure that we incorporate it into `TfidfVectorizer` and disable all preprocessing & tokenization that `TfidfVectorizer` performs by default. Afterwards, we also need to re-fit the vectorizer and the nearest-neighbors class. To make this a bit easier to handle, let‚Äôs take everything we have done so far and put it in a single class `AppSearcher`.

### Task 4.1

**Your first task** is to complete the stub of the `AppSearcher` class given below. Keep in mind:
- The `fit()` function should fit both the vectorizer (from Problem 1) and the nearest-neighbors class (from Problem 2).  Make sure to modify the call to `TfidfVectorizer` to _disable all preprocessing & tokenization_ that it would do by default, and replace it with a call to the `preprocess()` function _defined in `AppSearcher`_.
- For the `preprocess()` function, you can start by copying your solution from Problem 3.
- For the `search()` function, you can copy your solution from Problem 2.
- Make sure to adapt your code to store the everything (data, vectorizer, nearest-neighbors class) within the `AppSearcher` class, so that your solution is independent of the code you wrote above!

In [83]:
class AppSearcher:
    def fit(self, df):
        """Instantiate and fit all the classes required for the search engine (cf. Problems 1 and 2)."""
        self.df = df
        self.vectorizer = TfidfVectorizer() 
        self.vectorizer = TfidfVectorizer(analyzer=self.preprocess)
        self.nn =  NearestNeighbors(n_neighbors=10)


        X = self.vectorizer.fit_transform(df['description'])
        self.nn.fit(X=X)
        # YOUR CODE HERE
        # raise NotImplementedError()

    def preprocess(self, text):
        """Preprocess the given text (cf. Problem 3)."""
        # YOUR CODE HERE
        doc = nlp(text)
       
        res = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
        return res
        

    def search(self, query):
        """Find the nearest neighbors in `df` for a query string (cf. Problem 2)."""
        # YOUR CODE HERE
        v_q = self.vectorizer.transform([query])
        ns = self.nn.kneighbors(v_q, return_distance=False)[0]
        return self.df.iloc[np.array(ns)]


#### ü§û Test your code

The following cell demonstrates how your class should be used. Note that it can take a bit longer to train it on the data as before, since we‚Äôre now calling spaCy for the preprocessing.

In [84]:
apps = AppSearcher()
apps.fit(df)
apps.search("pile up pancakes")

Unnamed: 0,name,description
1032,Pancake Tower,Let's see how many pancakes you can pile up!!
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We..."
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d..."
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i..."
1263,Spider Solitaire,"Spider Solitaire was built to offer card players a fun way to play their favorite classic in both portrait and landscape mode.\n\nWith large cards and a unique stacking system our Spider card game doesn't have problems fitting your screen like many others do. \n\n* How to play *\n\nTo win a game of spider solitaire, all cards must be removed from the table. Assembling the cards in the tableau allows for cards to be placed in their respective stacks in order. At the beginning of each game, 54..."
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe..."
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ..."
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki..."
1245,Solitaire Free,"Solitaire by Gemego is the card game you know and love for your phone and tablet. Our Solitaire is beautifully designed with a simple interface to help you enjoy this classic game. \n\nOur Solitaire has the best card movement on the market. You don't need to select a specific card in a pile unlike other Solitaire games. \n\nFeatures\n‚òÖ Instructions - an overview of the rules of Solitaire\n‚òÖ Winning deals (random) - unlike any other Solitaire! \n‚òÖ One Card, Three Card and Vegas style games\n‚òÖ..."
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi..."


### Task 4.2

**Your second task** is to experiment with the effect of using (or not using) different preprocessing steps.  We always need to _tokenize_ the text, but other preprocessing steps are optional and require a conscious decision whether to use them or not, such as:
- lemmatization
- lowercasing all characters
- removing stop words
- removing tokens containing non-alphabetical characters

**Modify the definition of the `preprocess()` function** of `AppSearcher` to include/exclude individual preprocessing steps, run some searches, and observe if and how the results change.  Which search queries you try out is up to you ‚Äî you could compare searching for "pile up pancakes" with "pancake piling", for example; or you could try entirely different search queries aimed at different kinds of apps.  (You can modify the class directly by changing the cell above under Task 4.1, or copy the definitions to the cells below, whichever you prefer; there is no separate code to show for this task, but you will use your observations here for the individual reflection.)

In [103]:
class AppSearcher2:
    def fit(self, df):
        """Instantiate and fit all the classes required for the search engine (cf. Problems 1 and 2)."""
        self.df = df
        self.vectorizer = TfidfVectorizer() 
        self.vectorizer = TfidfVectorizer(analyzer=self.preprocess)
        self.nn =  NearestNeighbors(n_neighbors=10)


        X = self.vectorizer.fit_transform(df['description'])
        self.nn.fit(X=X)
        # YOUR CODE HERE
        # raise NotImplementedError()

    def preprocess(self, text):
        """Preprocess the given text (cf. Problem 3)."""
        # YOUR CODE HERE
        doc = nlp(text)
       
        res = [token.text for token in doc if token.is_alpha and not token.is_stop]
        return res
        

    def search(self, query):
        """Find the nearest neighbors in `df` for a query string (cf. Problem 2)."""
        # YOUR CODE HERE
        v_q = self.vectorizer.transform([query])
        ns = self.nn.kneighbors(v_q, return_distance=False)[0]
        return self.df.iloc[np.array(ns)]

app2 = AppSearcher2()
app2.fit(df)



In [104]:
print(apps.search("pancakes piling").iloc[0:10]['description'])


1032                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Let's see how many pancakes you can pile up!!
326     Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook f

In [105]:

print(app2.search("pancakes piling").iloc[0:10]['description'])

1032                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Let's see how many pancakes you can pile up!!
326     Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook f

In [106]:
print(apps.search("cooking").iloc[0:10]['description'])


825     Search for "BabyBus" for even more free panda games for you to try! \n\nEver wanted to have a taste of Chinese food? Now you can! BabyBus kids apps bring you a kitchen cooking game,that is, Panda Chef, Let's cook!, where you can cook within a multiple recipes. Pick ingredients, prepare food and explore cooking in your own way. You can also decorate your dishes with different toppings or condiments. Dress up as mini chefs, and clean up the messy kitchen! Don‚Äôt forget your chef‚Äôs hat!\n\nCooki...
326     Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only co

In [107]:
print(app2.search("cooking").iloc[0:10]['description'])


324     Cooking Joy 2 - a new highly addictive cooking game from the team that brought you Cooking Joy, is calling all master chef candidates! Upgraded from Cooking Joy - a fun cooking game, it inherits the same challenging spirit and adds more fun! \n\nIf you have always dreamt of becoming a top chef in a crazy cooking game world, then catch the cooking crazy fever with this game! Download NOW and try it for FREE!\n\nTime to get back to the kitchen and enjoy cooking delicious dishes for starving cu...
323     Welcome to Cooking Hotüë©‚Äçüç≥üë®‚Äçüç≥, the BEST chef game, FAST PACE Cooking Game and Restaurant Game! It brings you the Purest and Truest COOKING experiences. Come to different cooking cities, feel the craze and madness of a cooking games lover. Cook and Serve food as a crazy chef! \n\nIn this chef game, you act as a top chef and star chef. The craze for the cooking grabs you, and you are cooking in brink of madness. You work in the kitchens of different countries and coo

In [108]:
print(apps.search("re-opening").iloc[0:10]['description'])


687     Imagine unlimited possibilities in your smart home. Set scenes and fast effects to your mood.\n\nExperience Dance Sensation in your entertainment area with Philips Hue Entertainment. Enjoy a more colorful ambiance on your IKEA TRADFRI gateway.\n\nFeel more in control with schedules and automation from sunrise to sunset. Widgets, Shortcuts, Quick settings tiles, and Wear OS help you get more out of your smart lights.\n\nControl multiple bridges simultaneously without switching between them.\n...
430     Dr. Panda Restaurant is re-opening, and this time all the choices are yours! Make the pizza of your dreams, a pasta dish to rave about, or a soup so spicy your customers will breathe fire! Sweet or salty? Spicy or bitter? It‚Äôs up to you!\n\nKids can take charge in their own kitchen in Dr. Panda Restaurant 2! Future fine chefs have the freedom to choose what they want to prepare and exactly how they‚Äôd like to prepare it! Chop, grate, blend, fry and more with over 20 ingredient

In [109]:
print(app2.search("re-opening").iloc[0:10]['description'])


687     Imagine unlimited possibilities in your smart home. Set scenes and fast effects to your mood.\n\nExperience Dance Sensation in your entertainment area with Philips Hue Entertainment. Enjoy a more colorful ambiance on your IKEA TRADFRI gateway.\n\nFeel more in control with schedules and automation from sunrise to sunset. Widgets, Shortcuts, Quick settings tiles, and Wear OS help you get more out of your smart lights.\n\nControl multiple bridges simultaneously without switching between them.\n...
957     Once upon a time, there was wannabe baker whose dream came true! Enjoy this mouth-watering cake bakery story and help Lizzie fulfill her dream of someday opening up a sweet bakery of her own. Now that day has finally arrived! She's graduating from college, and she's more than ready to get baking some tasty cupcakes. But she needs your help! Have fun opening up lots of yummy bakeries, baking with Lizzie and becoming a true baking professional who‚Äôs famous for beautiful, tasty de

In [97]:
str(df.loc[656]['description'])

"‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe puzzle! Here you own a restaurant and rush serving numerous happy frenzy visitors from the street every day. Queen, you need to not only make perfect burger fast, tasty meals and bake delicious pancakes or cakes in chef role in cafe, but also serve hungry diner people in a dash. Stand and use everything at hand, as the whole kitchenette is at your full disposal. Here you will find whatever you want, as basic foodstuff for chef, so unique favors to make baking a fun! If your visitor ord

## Individual reflection

<div class="alert alert-info">
    <strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below.  Remember:
    <ul>
        <li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
        <li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
    </ul>
</div>

1. In Problem 1, which token had the highest tf‚Äìidf score, which the lowest?  Based on your knowledge of how tf‚Äìidf works, how would you explain this result?
2. Based on your observations in Problem 4, which preprocessing steps do you think are the most appropriate for this "search engine" example?  Why?

**Congratulations on finishing this lab! üëç**

<div class="alert alert-info">
    
‚û°Ô∏è Before you submit, **make sure the notebook can be run from start to finish** without errors.  For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "‚è©" button in the toolbar).

</div>