# Instructions

Your submission will be tested with the code tester. It is important to follow these instructions to ensure your work tests properly.

- Do not change the content of the cells under __SETUP__ and __TESTS__
- Work only in the __YOUR WORK__ area
- Rename the notebook with your group at the end (subsitute XX with your group number).
- Assign the results of each numbered question to the appropriate test variable. For example, the answer of `1.` should be assigned to `test_1`
- Rounding: use the supplied function `hround` to round decimal numbers when instructed. It's important to use this function because there are [multiple ways to round numbers in Python](https://www.knowledgehut.com/blog/programming/python-rounding-numbers) and they may not result in the same value that the tester is testing against.
- Ensure your run the cells under __SETUP__ before you run your work
- Before you submit your work, ensure you clean up your notebook. Your notebook has to run without an error in order to be tested. The easiest way to ensure is to `Kernel->Restart & Run All`
- Answers are provided in along with this notebook in eLC (look a picture named `solution_key`) for your convenience
- You will need to write a program to calculate the answers. Setting the answers to be their correct values without solving them is considered *hardcoding* and will result in zero grade for the assignment as well as a potential academic honesty violation.
- You can also test your submission using [the online code tester](https://notebook-tester.safadi-puzzler.com/)


# SETUP

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import spacy
from textblob import TextBlob
from spacy import displacy

In [2]:
# DO NOT EDIT OR CHANGE THE CONTENT OF THIS CELL
scenario = 0
nlp = spacy.load('en_core_web_sm')
import nltk;nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/hanisaf/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
def hround(number):
    return round(number, 2 - scenario)

In [4]:
test_1=test_2=test_3=test_4=test_5=test_6=test_7=test_8=test_9=test_10=0.0
test_11=test_12=test_13=test_14=test_15=0.0

In this homework, we have data from the [100 Resilient Cities](https://www.rockefellerfoundation.org/100-resilient-cities/#:~:text=Overview%20In%202013%2C%20The%20Rockefeller%20Foundation%20pioneered%20100,a%20roadmap%20to%20resilience%20along%20four%20main%20pathways%3A?msclkid=5705866aaaaa11ec85d9402890b45f8f). The data contains 30 documents from 30 cities outlining their strategies for urban resilience. We are going to focus on text and network analytics.

In [5]:
data = pd.read_json('resillient_cities.json')
data.head()

Unnamed: 0,city,country,text,ENTITIES
0,Melbourne,Australia,﻿PEOPLE ARE\nAT THE HEART OF ALL CITIES\nA res...,"{'LOCATION': ['Collingwood', 'Sri Lankan', 'So..."
1,Rio de Janeiro,Brazil,﻿100 Resilient Cities - Pioneered by The Rocke...,"{'ORGANIZATION': ['IPCC', 'PARTNERS World Bank..."
2,Medelin,Colombia,﻿A CITY THAT TRANSFORMS ITSELF FOR ITS PEOPLE\...,"{'LOCATION': ['Colombia', 'Proantioquia', 'l.A..."
3,Vejles,Denmark,﻿Letter from the Mayor\nArne Sigtenbjerggaard\...,"{'PERSON': ['Torben Christensen', 'Burgos', 'J..."
4,Quito,Ecuador,﻿MAURICIO RODAS\nMAYOR OF THE METROPOLITAN DIS...,"{'LOCATION': ['Mexico City Cuntlupo', 'Quito',..."


In [6]:
len(data)

30

## Part 1: text analytics

1. Report the unique city names in a sorted list

2. Report the unique country names in a sorted list

3. What are the top five frequent words in `text`, return the results in a sorted list

Now we will do some text analysis. Because NLP algorithms are computationally expensive, we are going to do them on one text (so that you and the tester save time evaluating the notebook)

4. Let us focus on the `text` of the city of `Glasgow`, extract the text from the data frame then perform sentiment analysis with `TextBlob`. Which sentence in this text has the lowest (most negative) polarity?

5. Using `spacy`, perform named entity detection on Glasgow's text. Now focus on organization `ORG` entities. What are the top 5 frequent organization entities. Return the results in a series sorted by that frequency.

Now, we want to benefit from the named entities to learn more about what was discussed in these documents. To save time, I am giving you the extracted named entities in a the column `ENTITIES`. The entities  are organized as a dictionary where the keys are entity types and the values are the instances of these entities from the text.

6. Let us perform the same analysis as before but on all documents. Extract the `ORGANIZATION` entities of all documents from the column `ENTITIES`. Report the five popular organization entities in a series sorted by frequencies.

## Part 2: network analytics

We will now use network analytics to better understand relationships among these entities and the cities.

7. Let us start easy. Using `networkx` create an undirected graph then add an edge from each city to each country based on the data in the columns `city` and `coutry` (no `ENTITIES` involved in this question). Report in a tuple the number of nodes and number of edges in this network.

8. using the `degrees` function, report the degrees of nodes in this network.

9. Which node has the highest degree?

10. Let us focus our attentions to organization entities. We want to better understand organizations' involvment with cities. Create an undirected network the edges represent links between each city and every organization entity pertaining to this city. How many nodes and edges are there in the network? return a tuple.

11. Using the above-created network, return a sorted list of cities associated with `United Nations`

12. We want to create a network similar to the previous one but a directed network this time. Create a `DiGraph` and add directed edges from each city to each organization showing in the `ENTITIES` of that city. We want to rank organizations based on their involvement with multiple cities. To do this look at the `in_degree` property of the created network. Sort based on the frequency of association (descending). Show the first 10 organizations with their indegrees as a list of tuples (again sorted based on that frequency).

13. As you can see, some organizations show as acronyms. Select only organization names with multiple words. Show the top ten in the same format as above.

14. In this last analysis, we want to identify potentially influential people. We define influence as being mentioned in the same text where a number containing the keyword `million` or the keyword `billion` is mentioned. Focus on the `ENTITIES` column and select only rows containing the keywords (million or billion) in a `MONEY` entity type. Then create an undirected network and associate each city to each person name from the `PERSON` list in `ENTITIES`. So this network has an association between cities and people mentioned in the text only when high money numbers are discussed (hence influence). Show the number of nodes and edges in a tuple.

15. Calculate closeness centrality and show the top 15 nodes with highest colesness centrality (in sorted order) with the following exclusions.

- exclude any name with `berkowitz` in any casing since `Michael Berkowitz` was the lead researcher on this project
- exclude any one word names
- exclude any name matching a city name (from the column city) in the data



In [7]:
test_1 = sorted(set(data.city))
test_2 = sorted(set(data.country))

In [8]:
words = pd.Series((data.text + " ").sum().split())
test_3 = sorted(words.value_counts().index[:5])

In [9]:
# most_negative_polarity 
text = data.loc[data.city=='Glasgow', 'text'].iloc[0]
blob = TextBlob(text)
glasgow_sentiment = [(sent.raw, sent.sentiment.polarity, sent.sentiment.subjectivity) for sent in blob.sentences]
test_4 = sorted(glasgow_sentiment, key=lambda t: t[1])[0][0]

In [10]:
doc = nlp(text)
org_entities = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
test_5 = pd.Series(org_entities).value_counts().head()

In [11]:
locations = data['ENTITIES'].str['ORGANIZATION'].sum()
test_6 = pd.Series(locations).value_counts().head()

In [12]:
net1 = nx.Graph()
net1.add_edges_from(data[['city', 'country']].values)
test_7 = len(net1.nodes), len(net1.edges)
test_8 = degrees =  net1.degree()

In [13]:
test_9 = max(degrees, key=lambda e: e[1] )

In [14]:
# network of the cities to organizations
data['organizations'] = data.ENTITIES.str['ORGANIZATION']
edges = data[['city', 'organizations']].explode(column='organizations').values
net2 = nx.Graph()
net2.add_edges_from(edges)
test_10 = len(net2.nodes), len(net2.edges)
#city of athens
test_11 = sorted(net2['United Nations'])

In [15]:
net3 = nx.DiGraph()
net3.add_edges_from(edges)
indegrees = sorted(net3.in_degree, key = lambda e: e[1], reverse=True)
test_12 = indegrees[:10]
test_13 = [e for e in indegrees if len(e[0].split()) > 1][:10]

In [16]:
#people of influence
data['money'] = data.ENTITIES.str['MONEY']
data['person'] = data.ENTITIES.str['PERSON']

In [17]:
step1 = data[['city', 'money', 'person']].explode('money', ignore_index=True).dropna()
step2 = step1[step1.money.str.contains("million") | step1.money.str.contains("billion")]
step3 = step2.explode('person')
edges = step3[['city', 'person']].values
net4 = nx.Graph()
net4.add_edges_from(edges)
test_14 = len(net4.nodes), len(net4.edges)

In [18]:
closeness = nx.closeness_centrality(net4)

In [19]:
sorted_closeness = sorted(closeness.items(), key=lambda e: e[1], reverse=True)
test_15 = [n for n, c in sorted_closeness if (len(n.split()) > 1) and ('berkowitz' not in n.lower()) and (n not in set(data.city))][:15]
test_15 = sorted(test_15)

# TESTS

In [20]:
### TEST 1
test_1

['Amman',
 'Athens',
 'Bangkok',
 'Berkeley',
 'Boulder',
 'Bristol',
 'Byblos',
 'Dakar',
 'El Paso',
 'Glasgow',
 'Greater Christchurch',
 'Medelin',
 'Melbourne',
 'Mexico City',
 'New Orleans',
 'Norfolk',
 'Oakland',
 'Pittsburgh',
 'Quito',
 'Ramallah',
 'Rio de Janeiro',
 'Rotterdam',
 'San Francisco',
 'Santa Fe',
 'Semarang',
 'Surat',
 'Thessaloniki',
 'Toyama',
 'Vejles',
 'Wellington']

In [21]:
## TEST 2
test_2

['Australia',
 'Brazil',
 'Colombia',
 'Denmark',
 'Ecuador',
 'Greece',
 'India',
 'Indonesia',
 'Japan',
 'Jordan',
 'Lebanon',
 'Mexico',
 'Netherlands',
 'New Zealand',
 'Palestine',
 'Senegal',
 'Thailand',
 'UK',
 'USA']

In [22]:
## TEST 3
test_3

['and', 'in', 'of', 'the', 'to']

In [23]:
## TEST 4
test_4

'There is no getting away from the fact that these are difficult times for Glasgow.'

In [24]:
## TEST 5
test_5

Glasgow                       26
GEAPP                          4
The Rockefeller Foundation     2
IDENTIFY                       2
Scottish Enterprise            2
dtype: int64

In [25]:
## TEST 6
test_6

Rockefeller Foundation        22
Resilience Strategy           20
The Rockefeller Foundation    20
NGOs                          18
CRF                           16
dtype: int64

In [26]:
## TEST 7
test_7

(49, 30)

In [27]:
## TEST 8
test_8

DegreeView({'Melbourne': 1, 'Australia': 1, 'Rio de Janeiro': 1, 'Brazil': 1, 'Medelin': 1, 'Colombia': 1, 'Vejles': 1, 'Denmark': 1, 'Quito': 1, 'Ecuador': 1, 'Athens': 1, 'Greece': 2, 'Thessaloniki': 1, 'Surat': 1, 'India': 1, 'Semarang': 1, 'Indonesia': 1, 'Toyama': 1, 'Japan': 1, 'Amman': 1, 'Jordan': 1, 'Byblos': 1, 'Lebanon': 1, 'Mexico City': 1, 'Mexico': 1, 'Rotterdam': 1, 'Netherlands': 1, 'Greater Christchurch': 1, 'New Zealand': 2, 'Wellington': 1, 'Ramallah': 1, 'Palestine': 1, 'Dakar': 1, 'Senegal': 1, 'Bangkok': 1, 'Thailand': 1, 'Bristol': 1, 'UK': 2, 'Glasgow': 1, 'Berkeley': 1, 'USA': 9, 'Boulder': 1, 'El Paso': 1, 'New Orleans': 1, 'Norfolk': 1, 'Oakland': 1, 'Pittsburgh': 1, 'San Francisco': 1, 'Santa Fe': 1})

In [28]:
## TEST 9
test_9

('USA', 9)

In [29]:
## TEST 10
test_10

(4472, 4948)

In [30]:
## TEST 11
test_11

['Amman', 'Byblos', 'Quito', 'Toyama']

In [31]:
## TEST 12
test_12

[('Rockefeller Foundation', 22),
 ('Resilience Strategy', 20),
 ('The Rockefeller Foundation', 20),
 ('NGOs', 18),
 ('CRF', 16),
 ('PRA', 12),
 ('City', 11),
 ('NGO', 10),
 ('City Council', 10),
 ('CRO', 10)]

In [32]:
## TEST 13
test_13

[('Rockefeller Foundation', 22),
 ('Resilience Strategy', 20),
 ('The Rockefeller Foundation', 20),
 ('City Council', 10),
 ('Platform Partners', 9),
 ('City Resilience Framework', 8),
 ('City Resilience Framework ( CRF', 8),
 ('World Bank', 7),
 ('Steering Committee', 7),
 ('the World Bank', 6)]

In [33]:
## TEST 14
test_14

(410, 417)

In [34]:
## TEST 15
test_15

['Bruna Santos',
 'Chico Mendes',
 'Cristina Mendonga',
 'Eduardo Paes',
 'Fundagao Roberto Marinho',
 'Instituto Pereira Passos',
 'Kirsten Kramer',
 'Lauretta Burke',
 'Luciana Nery',
 'Magdala Arioli',
 'Martha Macedo de Lima Barata',
 'Pensa Saia de Ideias',
 'Rio Resiliente',
 'Vargem Pequena',
 'Zoraide Gomes']