# Group 5: BDMM 1st Project

#### Made by:
* Catarina Candeias (m20200656@novaims.unl.pt)
* Catarina Urbano (m20200607@novaims.unl.pt)
* Margarida Pereira (m20201038@novaims.unl.pt) 
* Rita Ferreira (m20200661@novaims.unl.pt)
* Tiago Gonçalves (m20201053@novaims.unl.pt) 

# Big Data Modeling and Management Assigment


## 🍺 The Beer project  🍺 

As it was shown in classes, graph databases are a natural way of navegating distinct types of data. For this first project we will be taking a graph database to analyse beer and breweries!   

_For reference the dataset used for this project has been extracted from [kaggle](https://www.kaggle.com/ehallmar/beers-breweries-and-beer-reviews), released by Evan Hallmark. Even though the author does not present metada on the origin of the data it is probably a collection of open data from places like [beeradvocate](https://www.beeradvocate.com/)_ 

#### Problem description

Explore the database via python neo4j connector and/or the graphical tool in the NEO4J webpage. Answer the questions. Submit the results by following the instructions

#### Connection details to the neo4j database
```
Host: rhea.isegi.unl.pt:7474  
Username: neo4j  
Password: F3cfcrnvBev57KZ8mcMk78L9wHgJVZuJ 
Connect URL : bolt://rhea.isegi.unl.pt:7687
```


#### Questions


0. __Example Question__ _How many beers does the database contain?_
1. How many different countries exist in the database?
1. Most reviews:  
    1. Which `Beer` has the most reviews?  
    1. Which `Brewery` has the most reviews for its beers?
    1. Which `Country` has the most reviews for its beers? 
1. Find the user/users that have the most shared reviews (reviews of the same beers) with the user CTJman?
1. Which Portuguese brewery has the most beers?
1. From those beers (the ones returned from the previous question), which has the most reviews?
1. On average how many different beer styles does each brewery produce?
1. Which brewery produces the strongest beers according to ABV?
1. If I typically enjoy a beer due to its aroma and appearance, which beer style should I try?
1. Using Graph Algorithms answer **two** of the following questions:
    1. Which two Countries are most similiar when it comes to their **top 10** most produced Beer styles?
    1. Which beer has the most similar reviews as the beer `Super Bock Stout`?
    1. Which user is the most influential when it comes to reviews made?
    1. Which beer styles are more central when it comes the amount of beers? \
    Note: In case of a tie for the top entity, in terms of metrics outputed from the algorithms, **simply output the first.**
1. If you had to pick 3 beers to recommend using only this database, which would you pick and why?


 Questions 8 to 10 are somewhat open, which means we'll also be evaluating the reasoning behind your answer. So there aren't necessarily bad results there are only wrong criteria, explanations or execution. 
 
##### Groups  

Groups should have 4 to 5 people  
You should register your group on moodle. An email will be going out to everyone with the credentials for the database to use when storing the results.


##### Submission      

Submission of the query results to be done to the group's redis database (explained on the first class, credentials sent via email).  
The following format is expected:
```
    >>> redis.set("0", "358873")
```

This result should be the anwser of the group to question 0

The code used to produce the results and respective explations should be uploaded to moodle. They should have a clear reference to the group, either on the file name or on the document itself. Preferably one Jupyter notebook per group.

Delivery date: Until the **midnight of May 2nd**

##### Evaluation   

This will be 20% of the final grade.   
Each solution will be evaluated on 2 components: correctness of results and simplicity of the solution.  
All code will go through plagiarism automated checks. Groups with the same code will undergo investigation.

**Note:**
Remember the Neo4j is a shared database and when creating in-memory graphs please use your group's prefix.  
Ex. Instead of `my-graph` as the name of your graph please use `group0-my-graph`.

In [1]:
import py2neo
from pprint import pprint
username="neo4j"
password="F3cfcrnvBev57KZ8mcMk78L9wHgJVZuJ"
host="rhea.isegi.unl.pt"
port="7474"

secure_graph = py2neo.Graph(f"http://{username}:{password}@{host}:{port}")
secure_graph.run("MATCH () RETURN count(*)").data()

[{'count(*)': 9647598}]

### 1) How many different countries exist in the database?

In [2]:
secure_graph.run("""
                 MATCH (c:Country) 
                 RETURN count(distinct c) as Number_of_Countries 
""").data()

[{'Number_of_Countries': 200}]

### 2) Most reviews:
#### - a) Which Beer has the most reviews?

In [3]:
secure_graph.run("""
                 MATCH (r:Reviews)-[:ABOUT]-(b:Beers) 
                 return b.name as Beer, count(*) as Number_of_reviews
                 order by Number_of_reviews DESC
                 LIMIT 1
""").data()

[{'Beer': 'IPA', 'Number_of_reviews': 31387}]

#### - b) Which Brewery has the most reviews for its beers?

In [4]:
secure_graph.run("""
                 MATCH (r:Reviews)-[]-(b:Beers)-[]-(br:Breweries)
                 return br.name as Brewery,count(r) as Number_of_reviews
                 order by Number_of_reviews DESC
                 LIMIT 1
""").data()

[{'Brewery': 'Sierra Nevada Brewing Co.', 'Number_of_reviews': 175161}]

#### - c) Which Country has the most reviews for its beers?

In [5]:
secure_graph.run("""
                 MATCH (r:Reviews)-[a:ABOUT]-(b:Beers)-[]-(br:Breweries)-[]-(c:Country)
                 return c.country_digit as Country, count(a) as Number_of_reviews
                 order by Number_of_reviews DESC
                 LIMIT 1
""").data()

[{'Country': 'US', 'Number_of_reviews': 7524410}]

### 3) Find the user/users that have the most shared reviews (reviews of the same beers) with the user CTJman?

In [6]:
secure_graph.run("""
                 MATCH (u1:Username{user_name: 'CTJman'})-[]-(r1:Reviews)-[:ABOUT]-(b:Beers)-[:ABOUT]-(r2:Reviews)-[]-(u2:Username)
                 
                 where u1.user_name <> u2.user_name // preventing it from counting common reviews from the same user
                 
                 return u2.user_name as User, count(b) as Number_of_shared_reviews
                 order by Number_of_shared_reviews DESC
                 LIMIT 1
""").data()

[{'User': 'acurtis', 'Number_of_shared_reviews': 1428}]

### 4) Which Portuguese brewery has the most beers?

In [7]:
secure_graph.run("""
                 MATCH (b:Beers)-[:BREWED_AT]-(br:Breweries)-[:FROM]-(c:Country{country_digit:'PT'})
                 return br.name as Brewery,count(b) as Number_of_beers
                 order by Number_of_beers DESC
                 LIMIT 1
""").data()

[{'Brewery': 'Dois Corvos Cervejeira', 'Number_of_beers': 40}]

### 5) From those beers (the ones returned from the previous question), which has the most reviews?

In [8]:
#  The beers returned in the previous question are the ones from 'Dois Corvos Cervejeira':
secure_graph.run("""
                 MATCH (br:Breweries{name:'Dois Corvos Cervejeira'})-[:BREWED_AT]-(b:Beers)-[]-(r:Reviews)
                 return b.name as Beer, count(r) as Number_of_reviews
                 order by Number_of_reviews DESC
                 LIMIT 1
""").data()

[{'Beer': 'Finisterra', 'Number_of_reviews': 10}]

### 6) On average how many different beer styles does each brewery produce?

In [9]:

secure_graph.run("""
                 MATCH (br:Breweries) - [] - (b:Beers)  - [:OF_TYPE] - (s:Style)
                 with br.name as name, count(DISTINCT s) as Number_of_beer_styles 
                 
                 // DISTINCT so it doesn't count the same style more than once
                 
                 
                 return avg(Number_of_beer_styles) as Average_number_of_beer_styles
""").data()

[{'Average_number_of_beer_styles': 10.669977315921736}]

### 7) Which brewery produces the strongest beers according to ABV?

In [10]:
secure_graph.run("""
                 MATCH (br:Breweries) - [] - (b:Beers)
                 where b.abv <> "Unknown"    // Avoiding NaN Values
                 with br.name as Brewery, avg(tofloat(b.abv)) as Average_ABV // Getting the average ABV for each Brewery
                 return Brewery, Average_ABV
                 order by Average_ABV DESC
                 limit 1
                  
""").data()

[{'Brewery': '1648 Brewing Company Ltd', 'Average_ABV': 25.57777777777778}]

### 8) If I typically enjoy a beer due to its aroma and appearance, which beer style should I try?

In [11]:
# Appearence can be evaluated by the look
# Aroma can be evaluated by smell 
# both these variables range between 1-5
# to get the best beer style according to these 2 parameters: we'll get the style that has the highest average value 
# between these 2 factors

secure_graph.run("""
                 MATCH (r:Reviews)-[:ABOUT]- (b:Beers)  - [:OF_TYPE] - (s:Style)
                 where r.look <> 'Unknown' and r.smell <> 'Unknown' // excluding nans
                 with s.name as Beer_style, avg(tofloat(r.look)) as Avg_look, avg(tofloat(r.smell)) as Avg_smell
                 return Beer_style, Avg_look, Avg_smell, (Avg_look+Avg_smell)/2 as Avg_of_look_and_smell
                 order by Avg_of_look_and_smell DESC
                 limit 1 
                 
""").data()

[{'Beer_style': 'New England IPA',
  'Avg_look': 4.383595613210904,
  'Avg_smell': 4.41361476476119,
  'Avg_of_look_and_smell': 4.398605188986047}]

### 9) Using Graph Algorithms answer the questions:
#### - a) Which two Countries are most similiar when it comes to their **top 10** most produced Beer styles?

In [7]:
# We firstly create a graph with countries linked to their top 10 most produced beer styles


secure_graph.run(f"""                                    
    CALL gds.graph.create.cypher(
        'test-graph-group5_9a',
        'MATCH (c:Country) return id(c) as id UNION ALL MATCH (s:Style) return id(s) as id',
        
        'MATCH (c:Country)-[]-(br:Breweries)-[]-(b:Beers)-[]->(s:Style)
         WITH c as Country, s as Style, count(*) as times order by times desc
         WITH Country, collect(Style)[..10] as top_10, count(distinct Style) as nr_styles 
         // counting how many distinct styles each country produces and collecting the top 10 styles
         
         WHERE nr_styles >= 10 
         // Filtering countries that produce less than 10 styles out, to avoid similarity values of 1.0 
         // because 2 countries produce only 1 style (the same between both) 
         
         
         UNWIND top_10 as c_top_10 

         
         return id(Country) as source, id(c_top_10) as target'
    )
""").data()

[{'nodeQuery': 'MATCH (c:Country) return id(c) as id UNION ALL MATCH (s:Style) return id(s) as id',
  'relationshipQuery': 'MATCH (c:Country)-[]-(br:Breweries)-[]-(b:Beers)-[ot:OF_TYPE]->(s:Style)\n         WITH c as Country, s as Style, count(*) as times order by times desc\n         WITH Country, collect(Style)[..10] as top_10, count(distinct Style) as nr_styles \n         WHERE nr_styles >= 10\n         UNWIND top_10 as c_top_10 \n\n         \n         return id(Country) as source, id(c_top_10) as target',
  'graphName': 'test-graph-group5_9a',
  'nodeCount': 313,
  'relationshipCount': 1010,
  'createMillis': 1569}]

In [13]:
#Then we calculate the similarity using nodeSimilarity and get the country digits for the 2 most similar countries

secure_graph.run("""  
    CALL gds.nodeSimilarity.stream('test-graph-group5_9a')
    YIELD node1, node2, similarity
    RETURN gds.util.asNode(node1).country_digit AS Country1, gds.util.asNode(node2).country_digit AS Country2, similarity
    ORDER BY similarity DESCENDING
    LIMIT 1
""").data()

#there were multiple pairs of countries with the same similarity value, but we're only returning 1 as requested

[{'Country1': 'US', 'Country2': 'CA', 'similarity': 0.6666666666666666}]

#### - d) Which beer styles are more central when it comes the amount of beers?

In [None]:
# We start by creating a graph with Styles linked to Beers

secure_graph.run("""                                    
    CALL gds.graph.create(
        'test-graph-group5_9d',
        [
            'Style',
            'Beers'
        ],
        {
            OF_TYPE: {
                orientation: 'Natural'
            }
        }
    )
""").data()

In [15]:
#Then we use pageRank to calculate the most central styles

secure_graph.run(
    """
        CALL gds.pageRank.stream('test-graph-group5_9d') YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS Style, score
        ORDER BY score DESC LIMIT 10
    """
).data()

#We return the top 10 since no specific number was provided

[{'Style': 'American IPA', 'score': 5702.417230224609},
 {'Style': 'American Pale Ale (APA)', 'score': 2825.5490661621093},
 {'Style': 'American Imperial IPA', 'score': 2338.292272949219},
 {'Style': 'Belgian Saison', 'score': 2316.486224365234},
 {'Style': 'American Wild Ale', 'score': 1654.0356842041017},
 {'Style': 'American Imperial Stout', 'score': 1425.5628540039063},
 {'Style': 'American Porter', 'score': 1296.538104248047},
 {'Style': 'American Amber / Red Ale', 'score': 1242.990283203125},
 {'Style': 'American Stout', 'score': 1160.7561294555665},
 {'Style': 'Fruit and Field Beer', 'score': 985.5782577514648}]

### 10) If you had to pick 3 beers to recommend using only this database, which would you pick and why?

We would recommend the beers with the highest avg overall score and that have at least 20 reviews (to ensure trustworthyness)

Additionally, the availability should be year-round so one could drink the beer whenever he/she prefers and 'Not retired' - so it can still be bought

After applying the previous filtering, and since overall score takes into account score, taste, feel, smell and look we're recommending the 3 beers with the best overall average since they have the best characteristics according to the reviews


In [16]:
secure_graph.run("""
                 MATCH (r:Reviews)-[:ABOUT]- (b:Beers{availability:'Year-round',retired:'f'})-[:BREWED_AT]-(br:Breweries)-[:FROM]-(c:Country)
                 where r.overall <> 'Unknown' 
                 with b.name as Beer, avg(tofloat(r.overall)) as Avg_overall, count(r) as Number_of_reviews
                 where Number_of_reviews >=20
                 return Beer, Avg_overall
                 order by Avg_overall DESC
                 limit 3 
                 
""").data()

[{'Beer': 'Zombie Dust', 'Avg_overall': 4.527575619740728},
 {'Beer': 'Juice Jr.', 'Avg_overall': 4.392857142857142},
 {'Beer': 'Tropicália', 'Avg_overall': 4.374169435215941}]