# Big Data Modeling and Management Assigment (Question 9 Revision)

**Group member:** <br>
Lucas Correa (m20211006)<br>
Vera Canhoto (m20210659)<br>
Bruna Duarte (m20210669)<br>
Doyun Shin (m20200565)

## 🍺 The Beer project  🍺 

As it was shown in classes, graph databases are a natural way of navegating distinct types of data. For this first project we will be taking a graph database to analyse beer and breweries!   

_For reference the dataset used for this project has been extracted from [kaggle](https://www.kaggle.com/ehallmar/beers-breweries-and-beer-reviews), released by Evan Hallmark. Even though the author does not present metada on the origin of the data it is probably a collection of open data from places like [beeradvocate](https://www.beeradvocate.com/)_ 

#### Problem description

Explore the database via python neo4j connector and/or the graphical tool in the NEO4J webpage. Answer the questions. Submit the results by following the instructions


#### Questions


0. __Example Question__ _How many beers does the database contain?_
1. How many different countries exist in the database?
1. Most reviews:  
    1. Which `Beer` has the most reviews?  
    1. Which `Brewery` has the most reviews for its beers?
    1. Which `Country` has the most reviews for its beers? 
1. Find the user/users that have the most shared reviews (reviews of the same beers) with the user CTJman?
1. Which Portuguese brewery has the most beers?
1. From those beers (the ones returned from the previous question), which has the most reviews?
1. On average how many different beer styles does each brewery produce?
1. Which brewery produces the strongest beers according to ABV?
1. If I typically enjoy a beer due to its aroma and appearance, which beer style should I try?
1. Using Graph Algorithms answer **two** of the following questions:
    1. Which two countries are most similiar when it comes to their **top 10** most produced Beer styles? 
    2. <span style="color:red">Which beer has the most similar reviews as the beer `Super Bock Stout` ? </span>
    3. <span style="color:red">Which user is the most influential when it comes to reviews made? </span>
    4. Which beer styles are more central when it comes the amount of beers? 
    5. <span style="color:green">Which beer is the most influential when considering beers are conected by users who review them? </span>
    6. <span style="color:green">Users are connected together by their reviews to beers, taking into consideration the "overall" score they review as a weight, how many communities are formed from these relationships? How many users has the biggest community? </span> 
    
Notes: 
- We've added some more questions in <span style="color:green">green</span>, so you have a broader choice.
- Questions in <span style="color:red">red</span> have an added dificulty, which will be considered while grading if chosen.
- Consider creating nodes for the STYLES and USERS. 
- For an example on how to perform such CRUD operations, plese use the "load HW1 DB.ipynb" jupyter notebook.
- In case of a tie for the top entity, in terms of metrics outputed from the algorithms, **simply output the first.**

10. If you had to pick 3 beers to recommend using only this database, which would you pick and why?


 Questions 8 to 10 are somewhat open, which means we'll also be evaluating the reasoning behind your answer. So there aren't necessarily bad results there are only wrong criteria, explanations or execution. 
 
#### Groups  

Groups should have 4 to 5 people.
You should register your group on **moodle**.

#### Submission      

The code used to produce the results and respective explations should be uploaded to moodle. They should have a clear reference to the group, either on the file name or on the document itself. Preferably one Jupyter notebook per group.

Delivery date: Until the **midnight of May 16**

#### Evaluation   

This will be 20% of the final grade.   
Each solution will be evaluated on 2 components: correctness of results and simplicity of the solution.  
All code will go through plagiarism automated checks. Groups with the same code will undergo investigation.

### Loading the Database

#### Be sure that you **don't have** the neo4j docker container from the classes running (you can Stop it in the desktop app or with the command "docker stop testneo4j")
#### Let's create a new container for the Homework (notice how it has a different name):
    docker run --name HW1neo4j -p7474:7474 -p7687:7687 -d --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised_address="localhost:7473" --env NEO4J_dbms_connector_http_advertisedaddress="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised_address="localhost:7687" neo4j
    
#### The default container does not have any data whatsoever, we will have to load a database into our docker image:
- Download and unzip the "HW1 Database" file provided in Moodle.
- Copy the path of the "Data" folder of the unziped file, e.g. "C:\Users\nunoa\Desktop\data".
- In your command line (terminal in MacOs) paste the code: "docker cp C:\Users\nunoa\Desktop\data HW1neo4j:/". As you might have notice, you do not have a User called "nunoa", please use the appropriate path that you got from the previous step. Additionally, if your container has a different name than "HW1neo4j", please change it as well.
- Now let's restart our docker container, either in the User Interface (Docker Desktop) or in the command line by typping the command "docker restart HW1neo4j".
- Since Neo4j is trying to recognize a new database folder, this might take a bit (let's say 3 minutes), so don't worry.

<span style="color:red">**PLEASE NOTE:**</span> We disregarded the above database load instruction. Instead, the entire database was loaded with csv based method, via "load HW1 db.ipynb."

In [1]:
import py2neo
from pprint import pprint
username="neo4j"
password="test"
host="localhost"
port="7474"

secure_graph = py2neo.Graph(f"http://{username}:{password}@{host}:{port}")

secure_graph.run("MATCH () RETURN count(*)").data()

[{'count(*)': 9659261}]

#### Understanding the Database

In [2]:
result = secure_graph.run("""
        call db.labels();
""").data()
pprint(result)

[{'label': 'COUNTRIES'},
 {'label': 'CITIES'},
 {'label': 'BREWERIES'},
 {'label': 'BEERS'},
 {'label': 'REVIEWS'},
 {'label': 'STYLE'},
 {'label': 'USER'}]


In [3]:
result = secure_graph.run("""
        CALL db.relationshipTypes();
""").data()
pprint(result)

[{'relationshipType': 'REVIEWED'},
 {'relationshipType': 'BREWED'},
 {'relationshipType': 'IN'},
 {'relationshipType': 'HAS_STYLE'},
 {'relationshipType': 'POSTED'}]


#### 0. How many different countries exist in the database?

(Excluded in the newly released notebook and based on an email communication with the Lab Instructor)

In [4]:
# How many nodes of Countries label are there?

result = secure_graph.run("""
        MATCH (c:COUNTRIES)
        RETURN count(distinct c) AS Number_of_Countries
""").data()
print(result)

[{'Number_of_Countries': 200}]


#### 1. Most reviews:  
    A) Which `Beer` has the most reviews?  
    B) Which `Brewery` has the most reviews for its beers?
    C) Which `Country` has the most reviews for its beers? 

<span style="color:red">**PLEASE NOTE:**</span> We overwrite the variable "result" for all our answers to save memory, just in case, as we don't need to reuse these results.

In [5]:
# A) Which `Beer` has the most reviews?  
result=secure_graph.run("""
        MATCH (be:BEERS)-[r]->(:REVIEWS)
        RETURN be.name AS Beer, count(r) AS Review_Count
        ORDER BY count(r) DESC
        LIMIT 1
""").data()
print(result)

[{'Beer': 'IPA', 'Review_Count': 31387}]


In [6]:
# B) Which `Brewery` has the most reviews for its beers?
result = secure_graph.run("""
        MATCH (br:BREWERIES)-[]->(:BEERS)-[r]->(:REVIEWS)
        RETURN br.name AS Brewery, count(r) AS Review_Count
        ORDER BY count(r) DESC
        LIMIT 1
""").data()
print(result)

[{'Brewery': 'Sierra Nevada Brewing Co.', 'Review_Count': 175161}]


In [7]:
# C) Which `Country` has the most reviews for its beers? 
result = secure_graph.run("""
        MATCH (c:COUNTRIES)<-[]-(:CITIES)<-[]-(:BREWERIES)-[]->(:BEERS)-[r]->(:REVIEWS)        
        RETURN c.name AS Country, count(r) AS Review_Count
        ORDER BY count(r) DESC
        LIMIT 1
""").data()
print(result)

[{'Country': 'US', 'Review_Count': 7675823}]


#### 2. Find the user/users that have the most shared reviews (reviews of the same beers) with the user CTJman?


In [8]:
# Count beers not reviews since it's about common beers they reviewed on.
result = secure_graph.run("""
        MATCH (u:USER)<-[:POSTED]-(r1:REVIEWS)<-[:REVIEWED]-(be:BEERS)-[:REVIEWED]->(r2:REVIEWS)-[:POSTED]->(u2:USER)
        WHERE u.name ="CTJman"
        RETURN u2.name AS User, COUNT(DISTINCT be) AS Shared_Beer_Reviews 
        ORDER BY Shared_Beer_Reviews DESC 
        LIMIT 1
""").data()
print(result)

[{'User': 'acurtis', 'Shared_Beer_Reviews': 1428}]


#### 3. Which Portuguese brewery has the most beers?


In [9]:
result = secure_graph.run("""
        MATCH (c:COUNTRIES)<-[]-(:CITIES)<-[]-(br:BREWERIES)-[r]->(:BEERS)
        WHERE c.name = 'PT'
        RETURN br.name AS Brewery, count(r) AS Number_of_Beers
        ORDER BY Number_of_Beers DESC
        LIMIT 1
""").data()
print(result)

[{'Brewery': 'Dois Corvos Cervejeira', 'Number_of_Beers': 40}]


#### 4. From those beers (the ones produced in the brewery from the previous question), which has the most reviews?


In [10]:
result = secure_graph.run("""
        MATCH (r:REVIEWS)<-[:REVIEWED]-(be:BEERS)<-[:BREWED]-(br:BREWERIES)
        WHERE br.name = "Dois Corvos Cervejeira"
        RETURN be.name as Beer_Name, COUNT(r) as Review_Count
        ORDER BY Review_Count DESC 
        LIMIT 1
""").data()
print(result)

[{'Beer_Name': 'Finisterra', 'Review_Count': 10}]


#### 5. On average how many different beer styles does each brewery produce?


In [11]:
result = secure_graph.run("""
        MATCH (br:BREWERIES)-[:BREWED]->(:BEERS)-[:HAS_STYLE]->(s:STYLE)
        WITH br.name AS Brewery, COUNT(DISTINCT s) AS count
        RETURN round(avg(count), 1) AS Average_Beer_Style_per_Brewery
""").data()
print(result)
# We don't round the result to whole number, as the future usage of this value is unknown.

[{'Average_Beer_Style_per_Brewery': 10.7}]


#### 6. Which brewery produces the strongest beers according to ABV?


In [12]:
result = secure_graph.run("""
        MATCH (br:BREWERIES)-[b:BREWED]->(be:BEERS)
        WHERE be.abv <> 'nan'
        WITH br.name AS Brewery, round(avg(toFloat(be.abv)),2) AS Average_ABV
        RETURN Brewery, Average_ABV
        ORDER BY Average_ABV DESC
        LIMIT 1
""").data()
print(result)

[{'Brewery': '1648 Brewing Company Ltd', 'Average_ABV': 25.58}]


#### 7. If I typically enjoy a beer due to its aroma and appearance, which beer style should I try?


In [13]:
result = secure_graph.run("""
        MATCH (s:STYLE)<-[:HAS_STYLE]-(be:BEERS)-[:REVIEWED]->(r:REVIEWS)
        WHERE s.name <> 'nan' AND r.smell <> 'nan' AND r.look <> 'nan'
        RETURN s.name AS Style, round((avg(toFloat(r.smell))) + (avg(toFloat(r.look))),1) AS Aroma_Appearance_Score, count(r) AS Review_Count
        ORDER BY Aroma_Appearance_Score DESC
        LIMIT 1
""").data()
print(result)
# 1. We need to show the review count, since we need to see if the beer has enough reviews to justify the score.
# 2. There is no necessity to make the score in 5/5 form, as Aroma_Appearance_Score is a new metric.
# 3. We also don't need to filter the beers with smell higher than 4 or look to be higher than 4. So many beer has both scores 
# higher than 4. if a beer has one of the two metric lower than 4, it won't be the top result. 

[{'Style': 'New England IPA', 'Aroma_Appearance_Score': 8.8, 'Review_Count': 110696}]


#### 8. Using Graph Algorithms answer **two** of the following questions:
1. Which two countries are most similiar when it comes to their **top 10** most produced Beer styles? 
2. <span style="color:red">Which beer has the most similar reviews as the beer `Super Bock Stout` ? </span>
3. <span style="color:red">Which user is the most influential when it comes to reviews made? </span>
4. Which beer styles are more central when it comes the amount of beers? 
5. <span style="color:green">Which beer is the most influential when considering beers are conected by users who review them? </span>
6. <span style="color:green">Users are connected together by their reviews to beers, taking into consideration the "overall" score they review as a weight, how many communities are formed from these relationships? How many users has the biggest community? </span> 
    
Notes: 
- We've added some more questions in <span style="color:green">green</span>, so you have a broader choice.
- Questions in <span style="color:red">red</span> have an added dificulty, which will be considered while grading if chosen.
- Consider creating nodes for the STYLES and USERS. 
- For an example on how to perform such CRUD operations, plese use the "load HW1 DB.ipynb" jupyter notebook.
- In case of a tie for the top entity, in terms of metrics outputed from the algorithms, **simply output the first.**

#### 8.2 Which beer has the most similar reviews as the beer Super Bock Stout ?

<span style="color:red">**PLEASE NOTE**:</span> 
- There's "overall" and "score" in REVIEWS. We assume "overall" to be the score the user gave as how different smell, taste, etc are harmonized. For example, if it's a dark beer, it wouldn't really harmonize with a fruity taste, for many users. On the other hand, we assume "score" to be the final score of the beer taking everything into consideration. "score" is probably the aggregate of everything, perhaps with different weights that we are not aware of. Hence we exclude "score" from similarity metric as it is already represented by its components (feel, look, smell etc).
- Again, we overwrite the variable "result" for all our answers to save memory, just in case, as we don't need to reuse these results.

In [14]:
result = secure_graph.run("""
       CALL{ 
            MATCH (b:BEERS)-[:REVIEWED]->(r:REVIEWS)
           WHERE b.name = 'Super Bock Stout'
           RETURN 
            b.name as Super_Bock_Stout,
            avg(toFloat(r.feel)) AS feel,
            avg(toFloat(r.look)) AS look,
            avg(toFloat(r.smell)) AS smell,
            avg(toFloat(r.taste)) AS taste,
            avg(toFloat(r.overall)) AS overall
            }
       CALL{
           MATCH (r2:REVIEWS)<-[:REVIEWED]-(d:BEERS)
           WHERE d.name <> 'Super Bock Stout'
           RETURN
            d.name as Similar_Beer_Name,
            COALESCE(avg(toFloatOrNull(r2.feel))   ,avg(toFloatOrNull(r2.score))) AS feel2,
            COALESCE(avg(toFloatOrNull(r2.look))   ,avg(toFloatOrNull(r2.score))) AS look2,
            COALESCE(avg(toFloatOrNull(r2.smell))  ,avg(toFloatOrNull(r2.score))) AS smell2,
            COALESCE(avg(toFloatOrNull(r2.taste))  ,avg(toFloatOrNull(r2.score))) AS taste2,
            COALESCE(avg(toFloatOrNull(r2.overall)),avg(toFloatOrNull(r2.score))) AS overall2
            }
       RETURN Similar_Beer_Name, 
            gds.similarity.cosine([feel,look,smell,taste,overall],[feel2,look2,smell2,taste2,overall2]) AS Review_Similarity
       ORDER BY Review_Similarity DESC
       LIMIT 1
       """).data()
pprint(result, sort_dicts=False)

[{'Similar_Beer_Name': 'ROK IPA', 'Review_Similarity': 0.9999955414685279}]


#### 8.3 Which user is the most influential when it comes to reviews made?

In [15]:
# Step 0 - Clear graph if it already exists.
try:
    delete_graph = secure_graph.run("""
        CALL gds.graph.drop('Users_Reviews_Graph') YIELD graphName;
    """).data()

    pprint(delete_graph)
except Exception as e:
    print(e)

[{'graphName': 'Users_Reviews_Graph'}]


In [16]:
result = secure_graph.run("""
        CALL gds.graph.project(
            'Users_Reviews_Graph',
            ['USER', 'REVIEWS'],
            {POSTED:{orientation:'NATURAL'}}
        )
""").data()
pprint(result)

[{'graphName': 'Users_Reviews_Graph',
  'nodeCount': 9238063,
  'nodeProjection': {'REVIEWS': {'label': 'REVIEWS', 'properties': {}},
                     'USER': {'label': 'USER', 'properties': {}}},
  'projectMillis': 2087,
  'relationshipCount': 9073128,
  'relationshipProjection': {'POSTED': {'aggregation': 'DEFAULT',
                                        'orientation': 'NATURAL',
                                        'properties': {},
                                        'type': 'POSTED'}}}]


In [17]:
# Reference: https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/
# There is no weight values specified in POSTED relationship, hence we disregard weighted method
result = secure_graph.run("""
        CALL gds.pageRank.stream('Users_Reviews_Graph')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS Username, round(score, 2) AS Importance_Score
        ORDER BY Importance_Score DESC 
        LIMIT 1
""").data()
pprint(result, sort_dicts=False)

[{'Username': 'Sammy', 'Importance_Score': 1759.4}]


### 9. If you had to pick 3 beers to recommend using only this database, which would you pick and why?

**9.1.1 Excellent Portuguese Beers**

Here we define "excellent" beer as beers with score over 4. Of those excellent Portuguese beers, we will assume beers with higher number of reviews to be more reliably excellent for different tasters. The idea behind here is that Portugal is widely known as a wine culture country, but we want to find and recommend great beers of Portugal.

In [18]:
excellent_pt_beers = secure_graph.run("""
     MATCH (c:COUNTRIES)<-[]-(:CITIES)<-[]-(:BREWERIES)-[]->(B:BEERS)-[R:REVIEWED]->(RE:REVIEWS)
     WHERE RE.score <> 'nan' AND c.name ='PT'
     WITH B.name as Beer_Name, round(avg(toFloat(RE.score)),1) as Avg_Score, count(RE) AS Review_Counts
     WHERE Avg_Score > 4
     Return Beer_Name, Avg_Score, Review_Counts
     ORDER BY Review_Counts DESC
     LIMIT 3
""").data()

pprint(excellent_pt_beers, sort_dicts=False)

[{'Beer_Name': 'Voragem', 'Avg_Score': 4.1, 'Review_Counts': 7},
 {'Beer_Name': 'Passarola IPA', 'Avg_Score': 4.1, 'Review_Counts': 7},
 {'Beer_Name': 'Mean Sardine / De Molen Ginja Ninja',
  'Avg_Score': 4.2,
  'Review_Counts': 4}]


**9.1.2 Popular Portuguse Beers**

We tried to find the best beers in Portugal, however we noticed the number of reviews to be too little to be reliable. Hence we can also recommend the most popular beers in Portugal, regardless of the quality, just for him/her to experience the mainstream beer that average Portuguese population drink frequently (again, the reviews are only a proxy, as we believe this review website is not used by many Portuguese considering the number of reviews for Super Bock, arguably the most available beer in Portugal).

In [19]:
#9- If you had to pick 3 beers to recommend using only this database, which would you pick and why?

popular_pt_beers = secure_graph.run("""
     MATCH (c:COUNTRIES)<-[:IN]-(:CITIES)<-[:IN]-(:BREWERIES)-[:BREWED]->(B:BEERS)-[R:REVIEWED]->(RE:REVIEWS)
     WHERE RE.score <> 'nan' AND c.name ='PT'
     Return B.name AS Name, round(avg(toFloat(RE.score)),1) AS Avg_Score, COUNT(RE) as Review_Counts
     ORDER BY Review_Counts DESC
     LIMIT 3
""").data()

pprint(popular_pt_beers, sort_dicts=False)

[{'Name': 'Super Bock', 'Avg_Score': 2.8, 'Review_Counts': 391},
 {'Name': 'Sagres Cerveja', 'Avg_Score': 2.8, 'Review_Counts': 279},
 {'Name': 'Super Bock Stout', 'Avg_Score': 3.1, 'Review_Counts': 82}]


**9.2 "The Great Unknowns"**

We want to try to find those beers that are barely known to the users of this database, yet has an excellent score, using the number of reviews as a proxy. The range of 200 to 500 reviews was selected subjectively while considering the number of reviews of popular beers. 1) We believe the average score from 200 reviews is quite reliable, while 2) less than 500 reviews would indicate not many users have tasted the beer as many beers have thousands if not tens of thousands.

In [20]:
great_unknowns = secure_graph.run("""
     MATCH (b:BEERS)-[:REVIEWED]->(r:REVIEWS)
     WHERE r.overall <> 'nan'
     WITH b.name as Beer_Name, round(avg(toFloat(r.score)),2) as Avg_Score, count(r) AS Review_Counts
     WHERE Avg_Score > 4 AND Review_Counts >= 200 AND Review_Counts <= 500
     Return Beer_Name, Avg_Score, Review_Counts
     ORDER BY Avg_Score DESC
     LIMIT 3
""").data()

pprint(great_unknowns, sort_dicts=False)

[{'Beer_Name': 'Kentucky Brunch Brand Stout',
  'Avg_Score': 4.84,
  'Review_Counts': 434},
 {'Beer_Name': 'Drie Fonteinen Zenne Y Frontera',
  'Avg_Score': 4.74,
  'Review_Counts': 250},
 {'Beer_Name': 'King JJJuliusss', 'Avg_Score': 4.73, 'Review_Counts': 403}]


**9.3 "The Perfectionists"**

These beers scores exceptionally high (4.5 or higher) in every evaluation criterion and has a decent number of reviews which adds to the reliability of these scores.

In [21]:
perfect_beers = secure_graph.run("""
     MATCH (b:BEERS)-[:REVIEWED]->(r:REVIEWS)
     WHERE r.score <> 'nan' AND r.feel <> 'nan' AND r.look <> 'nan'AND 
         r.smell <> 'nan' AND r.taste <> 'nan' AND r.overall <> 'nan'
     WITH b.name as Beer_Name, 
         round(avg(toFloat(r.feel)),2) as Avg_Feel,
         round(avg(toFloat(r.look)),2) as Avg_Look,
         round(avg(toFloat(r.smell)),2) as Avg_Smell,
         round(avg(toFloat(r.taste)),2) as Avg_Taste,
         round(avg(toFloat(r.overall)),2) as Avg_Overall,
         round(avg(toFloat(r.score)),2) as Avg_Final_Score,
         count(r) AS Review_Counts
     WHERE Review_Counts >= 200 AND
         Avg_Feel > 4.5 AND
         Avg_Look > 4.5 AND
         Avg_Smell > 4.5 AND
         Avg_Taste > 4.5 AND
         Avg_Overall > 4.5
     Return Beer_Name, Avg_Final_Score, Review_Counts
     ORDER BY Avg_Final_Score DESC
     LIMIT 3
""").data()

pprint(perfect_beers, sort_dicts=False)

[{'Beer_Name': 'Kentucky Brunch Brand Stout',
  'Avg_Final_Score': 4.84,
  'Review_Counts': 434},
 {'Beer_Name': 'Good Morning', 'Avg_Final_Score': 4.79, 'Review_Counts': 918},
 {'Beer_Name': 'Marshmallow Handjee',
  'Avg_Final_Score': 4.75,
  'Review_Counts': 822}]


**Q9 Conclution:** <br>
If we have to choose exactly 3 beers, and only 3 beers, we would recommend  "The Perfectionists" as these are highly rated all across the different rating criteria, and have decent number of reviews for the ratings to be reliable. They should be great recommendations for anyone. These are: Kentucky Brunch Brand Stout, Good Morning, and Marshmallow Handjee.

If the person is experimental, however, "The Great Unknowns" would be excellent recommendations. 

The "excellent" Portuguese beers and "Popular Portuguese Beers" would only be recommended to the visitors of Portugal, as the former does not have enough reviews to be reliable and the latter has relatively low score of around 3.