# Analysis and Visualization of Complex Data
---
## Exercise 01 - Create your personal word cloud

In this exercise you will produce your personal word cloud using python

First we will need to install Wordcloud package (someone already did the most difficult job - a function that produces a word cloud - so let's take advantage of that!). Go to the terminal and:

```
$ pip install wordcloud
```

Import required packages

In [2]:
# Import the pandas library for data manipulation and analysis (e.g., reading CSV files, handling dataframes)
import pandas as pd

# Import NumPy for numerical operations and array handling (used for masks and matrix operations)
import numpy as np

# Import the os module to interact with the operating system
# (e.g., managing file paths, checking directories, listing files)
import os  # provides functions for interacting with the operating system

# Import matplotlib's pyplot module for creating and displaying visualizations
import matplotlib.pyplot as plt

# Import WordCloud class to generate word clouds
# Import STOPWORDS to access the default list of common words to exclude
from wordcloud import WordCloud, STOPWORDS

You may start by cloning the AVCAD github repository it to your local machine.

First check your current directory:

In [None]:
# Return the current working directory (the folder where Python is reading/saving files by default)
os.getcwd()

'c:\\Users\\psegurado\\Documents\\Aulas\\Mestrado_Data_Science\\Aulas_AVCAD_2026\\greends-avcad-2026_Backup\\exercises'

You may change the directory to which you want to clone the github repository (change path accordingly):

In [None]:
# Change the current working directory to the specified folder path
# After this command, Python will read and save files in this folder by default
os.chdir("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2026/Exercises_local")

# Display the current working directory to confirm the change was successful
os.getcwd()

'C:\\Users\\psegurado\\Documents\\Aulas\\Mestrado_Data_Science\\Aulas_AVCAD_2026\\greends-avcad-2026_Backup\\exercises'

Now you can clone the github repository to your local machine in the Terminal:

```
$ git clone https://github.com/isa-ulisboa/greends-avcd-2026.git
```

Now you can import the wine review dataset:

In [None]:
df = pd.read_csv("examples/winemag-data-130k-v2.csv", index_col=0)

Check table (first 5 rows):

In [None]:
df.head()

Table summary:

In [None]:
print("There are {} observations and {} features in this dataset. \n".format(df.shape[0],df.shape[1]))

Check only part of the table ("country", "description" and "points")

In [None]:
df[["country", "description","points"]].head()

Now let us run the wordcloud function with the imported data.

First it might be helpfull to get some info about wordcloud package:

In [None]:
?WordCloud

Combine all wine reviews into one big text and create a big fat cloud to see which characteristics are most common in these wines.

In [None]:
text = " ".join(review for review in df.description)
print ("There are {} words in the combination of all review.".format(len(text)))

Create stopword list (words that you don't want to include - make a test first and choose words accordingly):

In [None]:
stopwords = set(STOPWORDS)
stopwords.update(["drink", "now", "wine", "flavor", "flavors"])

Generate and display a word cloud image

In [None]:
wordcloud = WordCloud(max_font_size=100, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

NOTE: You've probably noticed the argument interpolation="bilinear" in the plt.imshow(). This is to make the displayed image appear more smoothly. For more information about the choice, here is a helpful link to explore more about this choice.

In this exercise you will adapt the code below using the keywords that best describe your interests. For that you may use the text from your CV, a motivation letter or simply a list keywords. First create and save a txt file (no commas or semicolons) containing your text.

I used a text file with a list of my publications (Note: the text was converted to csv and all punctuation marks were converted to "space".

In [None]:
text = open("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2023/greends-avdc-exercises/data/CV_PSegurado_EN_March2021.txt").read()
print(text)

Create stopword list (words that you don't want to include - make a test first and choose words accordingly):

In [None]:
stopwords = set(STOPWORDS)
stopwords.update(["Segurado", "Ferreira", "Aguiar", "Branco", "Schinegger", "Borja", "Reino", "Haidvogl", "Hermoso", "Filipe", "doi", "Ara√∫jo", "Teixeira", "Godinho", "Carreira", "dx", "based", "Total environment", "Oliveira", "Santos", "Almeida", "Catry", "Duarte", "Beja", "Rebelo", "Neves", "Pereira", "Pont", "Costa", "Feld", "Journal", "Using", "Research", "scitotenv", "J", "P", "G", "M", "V", "J","F", "T", "N" ,"C", "B", "S", "W", "de", "https", "D", "L", "E", "Science", "org", "Total", "Environment"])

#### Generate and display a word cloud image

In [None]:
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords).generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

NOTE: You've probably noticed the argument interpolation="bilinear" in the plt.imshow(). This is to make the displayed image appear more smoothly. For more information about the choice, here is a helpful link to explore more about this choice.

Set max_font_size, change the maximum number of word and lighten the background:

In [None]:
wordcloud = WordCloud(stopwords=stopwords, max_font_size=70, background_color="white", width=600, height=400,).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Save the image in the img folder:

In [None]:
wordcloud.to_file("wordcloud_PS.png")

## Plotting word clouds using shapes

You may also choose to plot your keywords within a given shape (e.g. that are related with your interests).

Firt import the "image" module of the Python Imaging Library (PIL):

In [None]:
from PIL import Image

Now you will need an image (e.g. in png format) in which the background has the value 255. Copy into your current directory. You may check it this way:

In [None]:
turtle_mask = np.array(Image.open("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2023/greends-avdc-exercises/turtle.png"))
turtle_mask

Check mask

In [None]:
turtle_mask = np.array(Image.open("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2023/greends-avdc-exercises/turtle.png"))
#turtle_mask[230:250, 240:250]
plt.imshow(turtle_mask)
plt.axis("off")
plt.show()

And finally you may create, export and print the wordcloud

In [None]:
wc = WordCloud(background_color="white",
                  max_words=1000, 
                  mask=turtle_mask,
                  stopwords=stopwords,
                  contour_width=2,
                  repeat=True,
                  max_font_size=50,
                  contour_color='grey')

# Generate a wordcloud
wc.generate(text)

# store to file
wc.to_file("turtle_wordcloud.png")

# show
plt.figure(figsize=[10,5])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

### Questions
- The world cloud you just produced is based on how many variables? 

- Which is the type of each variable in your word cloud?

# References
[Generating WordClouds in Python Tutorial](https://www.datacamp.com/tutorial/wordcloud-python)

[Python Wordcloud Tutorial](https://python-course.eu/applications-python/python-wordcloud-tutorial.php)

[How to create a word cloud in Python?](https://www.projectpro.io/recipes/create-word-cloud-python)