# Analysis and Visualization of Complex Data
---
## Exercise 01 - Create your personal word cloud

In this exercise you will produce your personal word cloud using python

First we will need to install Wordcloud package (someone already did the most difficult job - a function that produces a word cloud - so let's take advantage of that!). Go to the terminal and run:

```
$ pip install wordcloud
```

Then import required packages

In [None]:
# Import the pandas library for data manipulation and analysis (e.g., reading CSV files, handling dataframes)
import pandas as pd

# Import NumPy for numerical operations and array handling (used for masks and matrix operations)
import numpy as np

# Import the os module to interact with the operating system
# (e.g., managing file paths, checking directories, listing files)
import os  # provides functions for interacting with the operating system

# Import matplotlib's pyplot module for creating and displaying visualizations
import matplotlib.pyplot as plt

# Import WordCloud class to generate word clouds
# Import STOPWORDS to access the default list of common words to exclude
from wordcloud import WordCloud, STOPWORDS

In this exercise you can adapt the code I used to get my own wordcloud.

You may start by cloning the AVCAD repository (where the example files and this jupiter notebook are available) to your local machine.

First check your current directory:

In [None]:
# Return the current working directory (the folder where Python is reading/saving files by default)
os.getcwd()

Now change the directory to which you want to clone the github repository (change path accordingly):

In [None]:
# Change the current working directory to the specified folder path
# After this command, Python will read and save files in this folder by default
os.chdir("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2026/Exercises_local")

# Display the current working directory to confirm the change was successful
os.getcwd()

Now you may clone the github repository to your local machine runing this in the Command Terminal (firts you need to change to the directory where you want to store your clone):

```
$ git clone https://github.com/isa-ulisboa/greends-avcad-2026.git
```

Now adapt the code below using the keywords that best describe your interests. For that you may use the text from your CV, a motivation letter or simply a list keywords. First create and save a txt file (no commas or semicolons) containing your text.

I used a text file with a list of my publications (Note: the text was converted to csv and all punctuation marks were converted to "space".

In [None]:
# Open the text file located at the specified path
# .read() reads the entire content of the file into a single string
text = open(
    "C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2026/Exercises_local/data/CV_PSegurado_EN_March2021.txt"
).read()

# Print the full text content to the console
# This allows you to verify that the file was loaded correctly
print(text)

Create stopword list, i.e., a list of words that you don't want to include in the wordcloud. 


Tip: make a test first and choose words accordingly:

In [None]:
# Create a set of default stopwords provided by the wordcloud library
stopwords = set(STOPWORDS)

# Add additional custom words to the stopwords set
# These include author names, common publication terms, URLs, journal names,
# single letters, and generic scientific words that you don't want
# appearing in the WordCloud
stopwords.update(["Segurado", "Ferreira", "Aguiar", "Branco", "Schinegger", "Borja", "Reino", "Haidvogl", "Hermoso", "Filipe", "doi", "Araújo", "Teixeira", "Godinho", "Carreira", "dx", "based", "Total environment", "Oliveira", "Santos", "Almeida", "Catry", "Duarte", "Beja", "Rebelo", "Neves", "Pereira", "Pont", "Costa", "Feld", "Journal", "Using", "Research", "scitotenv", "J", "P", "G", "M", "V", "J","F", "T", "N" ,"C", "B", "S", "W", "de", "https", "D", "L", "E", "Science", "org", "Total", "Environment"])

#### Generate and display a word cloud image

In [None]:
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords).generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear') # 'interpolation="bilinear"' smooths the image to make it look less pixelated
plt.axis("off") # Remove the x and y axes (not useful for image visualization)
plt.show() # Render the image on the screen


NOTE: You've probably noticed the argument interpolation="bilinear" in the plt.imshow(). This is to make the displayed image appear more smoothly. For more information about the choice, here is a helpful link to explore more about this choice.

Set max_font_size, change the maximum number of word and lighten the background:

In [None]:
# Create a WordCloud object with customized settings and generate it from the text
wordcloud = WordCloud(
    stopwords=stopwords, # - stopwords=stopwords → removes unwanted words
    max_font_size=70, # - max_font_size=70 → limits the size of the largest word
    background_color="white", # - background_color="white" → sets the background color
    width=600, 
    height=400, # - width=600, height=400 → defines image dimensions (in pixels)
).generate(text) # - .generate(text) → builds the word cloud from the provided text string

# Create a new figure to display the WordCloud
plt.figure()

# Display the WordCloud image
# interpolation='bilinear' smooths the image for better visual quality
plt.imshow(wordcloud, interpolation='bilinear')

# Remove axis ticks and frame (not useful for image display)
plt.axis("off")

# Render the final image output
plt.show()

Save the image in the img folder:

In [None]:
wordcloud.to_file("wordcloud_PS.png")

## Plotting word clouds using shapes

You may also choose to plot your keywords within a given shape (e.g. that are related with your interests).

Firt import the "image" module of the Python Imaging Library (PIL):

In [None]:
from PIL import Image

Now you will need an image (e.g. in png format) in which the background has the value 255. Copy to your current directory. You may check it this way:

In [None]:
# Open the image file (turtle.png) using PIL (Python Imaging Library)
# Then convert the image into a NumPy array so it can be used as a mask in WordCloud
turtle_mask = np.array(
    Image.open("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2023/greends-avdc-exercises/turtle.png")
)

# Display the NumPy array representation of the image (matrix of pixel values)
# This lets you verify that the mask was loaded correctly
turtle_mask

Check mask

In [None]:
# Open the turtle image file using PIL and convert it into a NumPy array
# This array contains the pixel values and will later be used as a WordCloud mask
turtle_mask = np.array(
    Image.open("C:/Users/psegurado/Documents/Aulas/Mestrado_Data_Science/Aulas_AVCAD_2023/greends-avdc-exercises/turtle.png")
)

# (Optional – currently commented out)
# This would display a small slice of the image array (rows 230–250, columns 240–250)
# Useful for inspecting pixel values in a specific region of the mask
# turtle_mask[230:250, 240:250]

# Display the mask image to visually confirm it loaded correctly
plt.imshow(turtle_mask)

# Remove axis ticks and frame (not needed for image visualization)
plt.axis("off")

plt.show()

And finally you may create, export and print the wordcloud

In [None]:
# Create a WordCloud object with customized settings
wc = WordCloud(
    background_color="white",   # Set the background color of the image
    max_words=1000,             # Maximum number of words to display
    mask=turtle_mask,           # Use the turtle image as a shape mask
    stopwords=stopwords,        # Remove predefined and custom stopwords
    contour_width=2,            # Thickness of the outline around the mask shape
    repeat=True,                # Allow words to repeat to better fill the mask shape
    max_font_size=50,           # Maximum size for the largest word
    contour_color='grey'        # Color of the contour (outline) around the mask
)

# Generate the word cloud from the text data
wc.generate(text)

# Save the generated WordCloud image to a file in the working directory
wc.to_file("turtle_wordcloud.png")

# Create a matplotlib figure with custom size (width=10, height=5 inches)
plt.figure(figsize=[10,5])

# Display the generated WordCloud image
# interpolation='bilinear' smooths the visual appearance
plt.imshow(wc, interpolation='bilinear')

# Remove axes (not needed for image display)
plt.axis("off")

plt.show()

### Questions
- The world cloud you just produced is based on how many variables? 

- Which is the type of each variable in your word cloud?

# References
[Generating WordClouds in Python Tutorial](https://www.datacamp.com/tutorial/wordcloud-python)

[Python Wordcloud Tutorial](https://python-course.eu/applications-python/python-wordcloud-tutorial.php)

[How to create a word cloud in Python?](https://www.projectpro.io/recipes/create-word-cloud-python)