🐍 Python — Data Analytics Portfolio

Week 6 of the Leep Talent Data Technician Skills Bootcamp (Level 3)
This section showcases the Python skills I developed during the sixth week of my bootcamp, covering core Python syntax, data structures, control flow, pandas data analysis, data cleaning, outlier detection, and data visualisation — all in Google Colab using real-world datasets.

📁 Contents

Task	Topic	Skills
Setup & Environment	Google Colab, libraries, mounting Drive	pip, import, pd.read_csv, Drive mount
Core Syntax — Shortcuts & Comparisons	Operators and boolean logic	`+=`, `-=`, `*=`, `//=`, `%=`, `==`, `!=`, `<`, `>`
Data Structures	Lists, tuples, sets, dictionaries	indexing, slicing, `.append()`, `.get()`, `.keys()`
Control Flow — if / elif / else	Conditional logic	Customer tiers, time-of-day greetings, score graders
Day 2 Task 1 — FizzBuzz & Loops	Loops and modulo logic	`for`, `while`, `range()`, `%` operator
Day 3 Task 1 — Student Analysis (pandas)	pandas end-to-end analysis	`groupby`, `pivot_table`, `apply`, `sort_values`
Day 4 Task 1 — GDP Data Cleaning	Real-world data preparation	`dropna`, `str.strip`, `pd.to_datetime`, type conversion
Day 4 Task 2 — Visualisation	Charts and correlation analysis	Matplotlib, Seaborn, scatter, histogram, heatmap, IQR outliers

⚙️ Setup & Environment — Google Colab

Tool: Google Colab (Python 3)
Libraries: pandas, NumPy, Matplotlib, Seaborn

About the Environment

All Python work this week was completed in Google Colab — a cloud-based Jupyter notebook environment that runs in the browser without any local installation. Each new session requires re-running the setup cell and remounting Google Drive.

In my own words: Google Colab removes the friction of environment setup so you can focus on the analysis. The key habit to develop is understanding what needs re-running at the start of each session versus what only needs to run once — getting that wrong means Python can't find your files or libraries mid-notebook.

Setup Code

# Install libraries (run once per session)
!pip install pandas matplotlib seaborn --quiet

# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display options
pd.set_option("display.max_rows", 10)
sns.set(style="whitegrid")

print("Setup complete")

Mounting Google Drive

# Mount Google Drive to access files stored there
from google.colab import drive
drive.mount('/content/drive')

Uploading a Local File

# Upload directly from your device
from google.colab import files
uploaded = files.upload()

# Load the uploaded file into a DataFrame
df = pd.read_csv('/content/student.csv')
df.head()

Screenshot

📸 Open Google Colab. Run the setup cell above. Capture the output showing Setup complete at the bottom of the cell and the library installation progress. The cell run indicator (green tick/circle) should be visible on the left. Save as images/setup_colab_complete.png.

Portfolio Value

Understanding the Colab environment — what persists, what resets, and how to efficiently re-initialise a session — is practical knowledge that applies equally to any remote or cloud-based Python environment, including Azure Notebooks, AWS SageMaker, and Jupyter Hub deployments used by real organisations.

➕ Core Syntax — Shortcuts & Comparison Operators

Tool: Google Colab
Reference: shortcuts_and_comparisions.docx

About the Task

Practised Python's shortcut assignment operators and comparison operators — the building blocks of any script or data pipeline. Each operator was run and verified against expected output.

In my own words: Shortcut operators like += and %= make code shorter and more readable. They appear constantly in loops, accumulators, and data processing scripts, so being confident with them early means the logic of more complex code is easier to follow.

Shortcut Operators

# Addition shortcut
points = 10
points += 5
print(points)      # 15

# Subtraction shortcut
score = 20
score -= 5
print(score)       # 15

# Multiplication shortcut
price = 50
price *= 3
print(price)       # 150

# Division shortcut (returns float)
total = 100
total /= 4
print(total)       # 25.0

# Integer division (floors the result)
boxes = 53
boxes //= 10
print(boxes)       # 5

# Modulo (remainder)
rem = 27
rem %= 10
print(rem)         # 7

Comparison Operators

# Equality and inequality
number = 60
print(number == 60)   # True
print(number != 60)   # False

# Greater / less than
mark = 85
print(mark > 90)      # False
print(mark < 90)      # True

# Compound comparisons
age = 18
print(age >= 18)      # True
print(age <= 18)      # True

Practice Answers (from worksheet)

x = 10
x += 3
print(x)          # 13

price = 50
price *= 2
price /= 5
print(price)      # 20.0

a = 7
b = 10
print(a < b)      # True

age = 18
print(age >= 18)  # True
print(age != 21)  # True

Screenshot

📸 In Google Colab, run the shortcut operators code block. Capture the output showing the results (15, 15, 150, 25.0, 5, 7) clearly visible below the cell. The cell input (code) and output should both be visible in the same capture. Save as images/syntax_shortcuts_output.png.

📦 Data Structures — Lists, Tuples, Sets & Dictionaries

Tool: Google Colab
Reference: List__tuple_and_Dictionary_Tasks.docx

About the Task

Worked through Python's core data structure types — lists, tuples, sets, and dictionaries — practising creation, indexing, slicing, modification, and retrieval for each.

In my own words: Data structures are how Python organises information in memory. Lists are the most flexible because they're ordered and mutable. Tuples are for data that shouldn't change. Sets automatically remove duplicates. Dictionaries are the most powerful for labelled data — essentially a lookup table, which is how a lot of real-world data processing works.

Lists

# Create and type-check
mylist = ["apple", "banana", "cherry"]
print(type(mylist))          # <class 'list'>

# Mixed-type list (real-world customer record)
customer_info = ["Mr.X", 22, "UK", "12/05/2001", 1.65]
print(customer_info)

# Negative indexing — get elements [2, 3, 4, 5]
mylist2 = [1, 2, 3, 4, 5, 6, 7]
print(mylist2[-6:-2])        # [2, 3, 4, 5]

# Slicing
nums = [10, 20, 30, 40, 50, 60, 70, 80, 90]
print(nums[2:7])             # [30, 40, 50, 60, 70]

# Extend, append, insert
thislist = ["apple", "banana", "cherry"]
newlist = ["mango", "orange", "plum", "pineapple", "peach"]
thislist.extend(newlist)
thislist.append("orange")
thislist.insert(2, "watermelon")

# Delete a slice
del thislist[2:5]
print(thislist)

Tuples

# Create tuple and access by index
mytuple = ("apple", "banana", "cherry")
print(mytuple[1])            # banana
print(mytuple[-1])           # cherry (negative index)

# Concatenation and repetition
my_tuple = (1, 10, 100)
t1 = my_tuple + (1000, 10000)
t2 = my_tuple * 3
print(t2)                    # (1, 10, 100, 1, 10, 100, 1, 10, 100)
print(len(t2))               # 9

# Membership test
print(10 in my_tuple)        # True
print(-10 not in my_tuple)   # True

Sets

# Sets are unordered and remove duplicates
mySet2 = {"apple", "banana", "cherry", "apple", "banana"}
print(mySet2)                # {'cherry', 'banana', 'apple'}

Dictionaries

# Create a dictionary
mydict = {"brand": "Ford", "model": "Mustang", "year": 1964}
print(mydict.get("brand"))   # Ford

# Keys and values
dictionary = {"cat": "chat", "dog": "chicken", "horse": "cheval"}
print(dictionary.keys())     # dict_keys(['cat', 'dog', 'horse'])
print(dictionary.values())   # dict_values(['chat', 'chicken', 'cheval'])

Screenshots

Screenshot 1 — List operations (slicing and extend)

📸 In Google Colab, run the list slicing and .extend() code. Capture the cell with both the code and the output showing the list before and after modification. Save as images/ds_list_operations.png.

Screenshot 2 — Dictionary keys and values

📸 Run the dictionary code block. Capture the cell showing the .keys() and .values() outputs clearly. Save as images/ds_dict_keys_values.png.

Portfolio Value

Every pandas DataFrame is built on Python dictionaries under the hood. Understanding how dictionaries, lists, and indexing work in plain Python makes debugging and extending pandas operations far more intuitive — and is a prerequisite for writing efficient data pipeline code.

🔀 Control Flow — if / elif / else

Tool: Google Colab
Reference: if_else__elif.xlsx

About the Task

Practised writing conditional logic using if, elif, and else — including nested conditions and compound conditions with and. Applied to real business scenarios such as customer tiering, time-based greetings, and score grading.

In my own words: if/elif/else is how Python makes decisions. In data analysis, it's used constantly — for categorising records, flagging anomalies, assigning grades, and routing logic. Understanding the order of conditions matters: Python evaluates from top to bottom and stops at the first match, so the most specific condition always comes first.*

Customer Tier System (with nested and compound conditions)

annualSales = 300000
region = "North"

if annualSales >= 500000:
    print("Gold Customer")
elif annualSales >= 300000:
    print("Silver Customer")
    # Nested if — adds region-specific logic
    if region == "North":
        print("Priority support for North region")
elif annualSales >= 100000:
    print("Bronze Customer")

print("Thank you for your business")

# Output:
# Silver Customer
# Priority support for North region
# Thank you for your business

Number Comparison

number1 = int(input("Enter the first number: "))
number2 = int(input("Enter the second number: "))

if number1 > number2:
    print("The larger number is:", number1)
elif number2 > number1:
    print("The larger number is:", number2)
else:
    print(number1, "is equal to", number2)

Positive / Negative / Zero Checker

num = float(input("Enter a number: "))

if num > 0:
    print("The number is positive.")
elif num < 0:
    print("The number is negative.")
else:
    print("The number is zero.")

Additional Examples Written Independently

# Score grader
user_number = int(input("Provide a number: "))

if user_number >= 100:
    print("Excellent")
elif user_number >= 75:
    print("Good")
elif user_number >= 50:
    print("Average")
else:
    print("Below Average")

# Time-based greeting
hour = 14

if hour < 12:
    print("Good Morning")
elif hour < 18:
    print("Good Afternoon")   # Output: Good Afternoon
elif hour < 21:
    print("Good Evening")
else:
    print("Good Night")

# Membership discount
status = input("Enter your membership level (Gold, Silver, Bronze): ")

if status == "Gold":
    print("Discount: 20%")
elif status == "Silver":
    print("Discount: 10%")
elif status == "Bronze":
    print("Discount: 5%")
else:
    print("No discount available.")

Screenshot

📸 In Google Colab, run the customer tier system code with annualSales = 300000 and region = "North". Capture the cell showing the three lines of output: "Silver Customer", "Priority support for North region", and "Thank you for your business". Save as images/cf_customer_tier.png.

Portfolio Value

Conditional logic is the backbone of data categorisation — assigning grades, flagging outliers, routing records to different processing paths. This task demonstrates I can write multi-branch logic correctly, including nested conditions, which is directly relevant to the assign_grade() function used in the pandas exercises later in the week.

🔁 Day 2 · Task 1 — FizzBuzz & While / For Loops

Tool: Google Colab
Reference: while_for_loops_table_and_Fbuzz.xlsx, workbook Day 2

About the Task

FizzBuzz is a classic programming interview challenge. The task also required writing while and for loops for even numbers, multiples of 7, and odd number sequences — demonstrating understanding of iteration, range(), step arguments, and the modulo operator.

In my own words: FizzBuzz tests whether you understand conditional ordering — the divisible-by-15 check must come before the individual 3 and 5 checks because Python stops at the first True condition. Getting that order wrong produces incorrect output silently, which is a good lesson in why the sequence of conditions matters in any real-world classification logic.*

Real-World Context

Organisation type: Any tech team conducting coding interviews
FizzBuzz is used in software engineering interviews precisely because it requires combining loops, conditionals, and the modulo operator correctly — testing logical thinking under pressure rather than just syntax knowledge.

FizzBuzz

# range(1, 101) — starts at 1, ends exactly at 100
# FizzBuzz check MUST come first to avoid being skipped by the
# individual Fizz or Buzz conditions
for n in range(1, 101):
    if n % 3 == 0 and n % 5 == 0:
        print("FizzBuzz")
    elif n % 3 == 0:
        print("Fizz")
    elif n % 5 == 0:
        print("Buzz")
    else:
        print(n)

While Loops

# Even numbers 0–100 using while
my_number = 0
while my_number <= 100:
    print(my_number)
    my_number += 2

# Multiples of 7 using while + if
my_number = 0
while my_number <= 100:
    if my_number % 7 == 0:
        print(my_number)
    my_number += 1

For Loops with range()

# Numbers 1 to 50
for n in range(1, 51):
    print(n, end=" ")

# Even numbers 0–100 (step argument)
for n in range(0, 101, 2):
    print(n, end=" ")

# Multiples of 7 (step = 7)
for n in range(0, 101, 7):
    print(n, end=" ")

# Odd numbers 1–100
for n in range(1, 101, 2):
    print(n, end=" ")

Screenshots

Screenshot 1 — FizzBuzz output (first 20 numbers)

📸 In Google Colab, run the FizzBuzz code. Capture the output scrolled to show the first 20 numbers — you should clearly see 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz, 16... demonstrating all three conditions working correctly. Save as images/loop_fizzbuzz.png.

Screenshot 2 — Even numbers 0–100 (while loop)

📸 Run the even-numbers while loop. Capture the output showing even numbers from 0 to 100. The final value visible should be 100. Save as images/loop_while_even.png.

Screenshot 3 — For loop with step argument (multiples of 7)

📸 Run the multiples-of-7 for loop using range(0, 101, 7). Capture the output showing 0, 7, 14, 21... 98. Save as images/loop_for_range_step.png.

Key Findings

Finding 1: Ordering the FizzBuzz check first is essential — if n % 3 == 0 came first, every multiple of 15 would print "Fizz" instead of "FizzBuzz". This mirrors real-world classification logic where overlapping categories must be resolved by specificity.
Finding 2: The step argument in range() is more efficient than using a while loop with manual increment — it communicates intent directly in the function call and reduces the chance of accidental infinite loops.

Portfolio Value

FizzBuzz demonstrates that I can combine loops, conditionals, and the modulo operator correctly — a combination that appears in data processing tasks such as batching records, flagging periodic events, and applying rules in sequence.

📊 Day 3 · Task 1 — Student Data Analysis with pandas

Dataset: student.csv
File: student.csv
Tool: Google Colab, pandas

About the Dataset

A classroom dataset containing student names, class groups, gender, and marks. Used for a structured group exercise covering the full pandas workflow from loading and exploring through to aggregation, advanced operations, and export.

In my own words: This exercise walked through every pandas operation a junior analyst would need to know on day one. The grade assignment function using apply() was the most interesting part — it shows how Python functions and pandas work together to create entirely new analytical columns from existing data.*

Real-World Context

Organisation type: School, training provider, or any organisation tracking performance metrics
A teacher or education data analyst would use these exact techniques to produce class performance reports, identify at-risk students, and compare performance across demographic groups — all operations that would take much longer in a spreadsheet.

Exercise 1 — Loading and Exploring the Data

import pandas as pd

# Load CSV
df = pd.read_csv('student.csv')

# First 5 rows
df.head()

# Column info and data types
df.info()

# Summary statistics
df.describe()

Exercise 2 — Indexing and Slicing

# Single column
df['name']

# Multiple columns
df[['name', 'mark']]

# First 3 rows
df.head(3)

# Filter rows where class is 'Four'
df[df['class'] == 'Four']

Exercise 3 — Data Manipulation

# Add a 'passed' column (True/False)
df['passed'] = df['mark'] >= 60

# Rename column
df = df.rename(columns={'mark': 'score'})

# Drop a column
df = df.drop(columns=['passed'])

Exercise 4 — Aggregation and Grouping

# Mean score per class
df.groupby('class')['score'].mean()

# Count students per class
df.groupby('class')['name'].count()

# Average score by gender
df.groupby('gender')['score'].mean()

Exercise 5 — Advanced Operations

# Pivot table: class vs gender, values = score
df.pivot_table(index='class', columns='gender', values='score')

# Assign grades using a custom function
def assign_grade(score):
    if score >= 85:
        return 'A'
    elif score >= 70:
        return 'B'
    elif score >= 60:
        return 'C'
    else:
        return 'D'

df['grade'] = df['score'].apply(assign_grade)

# Sort by score descending
df.sort_values(by='score', ascending=False)

Exercise 6 — Exporting Data

# Save with new 'grade' column to CSV
df.to_csv('student_grades.csv', index=False)

Screenshots

Screenshot 1 — df.head() and df.info()

📸 In Google Colab, run the load and explore cells. Capture the output showing df.head() (first 5 rows with all columns visible) and below it df.info() showing the column names, data types, and non-null counts. Save as images/pandas_student_load.png.

Screenshot 2 — groupby mean score per class

📸 Run df.groupby('class')['score'].mean(). Capture the output showing each class name on the left and its average score on the right. Save as images/pandas_student_groupby.png.

Screenshot 3 — Pivot table (class × gender)

📸 Run the pivot table code. Capture the output table showing class names in rows, gender columns (Female/Male or similar), and average score values in each cell. Save as images/pandas_student_pivot.png.

Screenshot 4 — Grade column applied + sorted descending

📸 Run df['grade'] = df['score'].apply(assign_grade) followed by df.sort_values(by='score', ascending=False). Capture the resulting DataFrame with the grade column visible, sorted so the highest scores and 'A' grades appear at the top. Save as images/pandas_student_grades.png.

Key Findings

Finding 1: groupby() reduced a complex manual calculation to a single line — grouping by class and averaging scores would take minutes in a spreadsheet for a large dataset but runs instantly in pandas regardless of size.
Finding 2: Using apply() with a custom function to assign grades is far more readable and maintainable than nested np.where() calls — the function logic is clearly visible and easy to modify if the grading boundaries change.

Portfolio Value

This exercise covers the complete pandas workflow that appears in most junior analyst technical tests: load, explore, filter, transform, aggregate, and export. The ability to write a groupby aggregation and a custom apply() function demonstrates intermediate pandas competence, not just beginner-level data loading.

🌍 Day 4 · Task 1 — GDP Data Cleaning & Preparation

Dataset: GDP (nominal) per Capita.csv
Files: GDP__nominal__per_Capita.csv, GDP_nominal_per_capita_clean.csv
Notebook: Week4.ipynb
Tool: Google Colab, pandas

About the Dataset

A global GDP per capita dataset covering approximately 200 countries and territories, with estimates from three sources — IMF, World Bank, and UN — alongside the year each estimate was recorded. Columns: Country/Territory, UN_Region, IMF_Estimate, IMF_Year, WorldBank_Estimate, WorldBank_Year, UN_Estimate, UN_Year.

In my own words: Real datasets are never clean. This one had missing values, inconsistent whitespace in country names, mixed data types, and numeric columns stored as strings. Working through each of these systematically — checking first, then fixing, then verifying — is how a professional analyst prepares data before any analysis or visualisation takes place.*

Real-World Context

Organisation type: International research organisation, government economics team, financial institution
Any analysis of global economic data needs to account for missing estimates (not all countries report to all three agencies), inconsistent formatting from different data sources, and the difference between a year stored as an integer versus a proper datetime object — which affects how time-based filtering and comparison work downstream.

Task 1 — Load and Inspect

import pandas as pd

df = pd.read_csv("GDP (nominal) per Capita.csv")

df.head(10)     # First 10 rows
df.tail(5)      # Last 5 rows

# View two key columns
df[['Country/Territory', 'UN_Region']]

# Column info
df.info()

# Summary statistics
df.describe()

# Value counts per region
df['UN_Region'].value_counts()

Task 2 — Check and Clean Missing Values

# Total missing values per column
df.isnull().sum()

# Total across the whole DataFrame
df.isnull().sum().sum()

# Rows where all values are missing
rows_all_na = df[df.isna().all(axis=1)]

# Rows where any value is missing
rows_with_na = df[df.isna().any(axis=1)]

# Create a cleaned version removing all rows with any missing value
gdp_clean = df.dropna()

# Verify no missing values remain
gdp_clean.isna().values.any()   # False

# Double-check per column
gdp_clean.isnull().sum()

Task 3 — Format and Fix Data Types

print("Initial Data Types:")
print(df.dtypes)

# a) Strip whitespace from country names
df['Country/Territory'] = df['Country/Territory'].str.strip()

# b) Convert IMF_Year to datetime
df['IMF_Year'] = pd.to_datetime(
    df['IMF_Year'],
    format='%Y',
    errors='coerce'     # invalid entries become NaT (Not a Time)
)

print(df['IMF_Year'].head())
print(df['IMF_Year'].dtype)   # datetime64[ns]

# c) Convert WorldBank_Estimate to integer
#    (coerce non-numeric to NaN, fill with 0, then cast to int)
df['WorldBank_Estimate'] = (
    pd.to_numeric(df['WorldBank_Estimate'], errors='coerce')
      .fillna(0)
      .astype(int)
)

print(df['WorldBank_Estimate'].dtype)   # int64
print(df.dtypes)

What is NaT?
NaT stands for Not a Time — the datetime equivalent of NaN. It appears when a date value is missing or could not be converted, and behaves like a null in all datetime operations.

Task 4 — Find Outliers (IQR Method)

col = 'IMF_Estimate'

# Calculate quartiles and IQR
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Filter to outlier rows only
outliers = df[(df[col] < lower) | (df[col] > upper)]

print(outliers.shape)   # How many outliers found
outliers.head()         # First few outlier rows (high-GDP countries)

Save the Cleaned File

# Create cleaned copy with all transformations applied
gdp_clean = df.dropna().copy()
gdp_clean['Country/Territory'] = gdp_clean['Country/Territory'].str.strip()
gdp_clean['IMF_Year'] = pd.to_datetime(
    gdp_clean['IMF_Year'], format='%Y', errors='coerce'
)

# Save to CSV
gdp_clean.to_csv('GDP_nominal_per_capita_clean.csv', index=False)

# Download to device
from google.colab import files
files.download('GDP_nominal_per_capita_clean.csv')

Screenshots

Screenshot 1 — df.head(10) showing the GDP dataset

📸 In Google Colab, run df.head(10). Capture the full table showing all 9 columns (Country/Territory, UN_Region, IMF_Estimate, IMF_Year, WorldBank_Estimate, WorldBank_Year, UN_Estimate, UN_Year) and the first 10 rows. Monaco should appear as the first country (highest GDP). Save as images/gdp_head10.png.

Screenshot 2 — Missing values per column

📸 Run df.isnull().sum(). Capture the output showing each column name and its count of missing values. Columns with non-zero counts should be clearly visible. Save as images/gdp_missing_values.png.

Screenshot 3 — Data types before and after conversion

📸 Run the print("Initial Data Types:") block first, then after all three type conversions run print(df.dtypes) again. Capture both outputs side-by-side or in sequence to show how IMF_Year changed from object/int to datetime64 and WorldBank_Estimate changed to int64. Save as images/gdp_dtypes_change.png.

Screenshot 4 — IQR outlier detection result

📸 Run the IQR outlier block. Capture the output of outliers.head() showing the rows flagged as outliers — these should be high-GDP countries like Monaco, Liechtenstein, or Luxembourg. The Q1, Q3, IQR, lower, and upper bounds should be visible if printed, or the table alone is sufficient. Save as images/gdp_outliers_iqr.png.

Key Findings

Finding 1: The errors='coerce' parameter in pd.to_datetime() and pd.to_numeric() is essential for defensive data cleaning — it converts unparseable values to NaT or NaN rather than crashing the notebook, which is the correct behaviour in production code.
Finding 2: The IQR outlier detection flags high-GDP countries like Monaco and Luxembourg as statistical outliers. These aren't data errors — they are genuinely extreme values. A responsible analyst documents this distinction: outliers require investigation, not automatic removal.

Portfolio Value

This task demonstrates a complete, professional data cleaning workflow — the exact steps an analyst would follow before handing data to a reporting or modelling pipeline: check, clean, retype, validate, detect anomalies, and export. The use of .copy() to avoid modifying the original DataFrame is also good practice that distinguishes careful from careless code.

📈 Day 4 · Task 2 — Data Visualisation

Dataset: GDP (nominal) per Capita.csv / GDP_nominal_per_capita_clean.csv
Notebook: W6D4_Task_for_workbook__1_.ipynb
Tool: Google Colab, Matplotlib, Seaborn

About the Task

Extended the GDP analysis into visualisation — creating scatter plots, histograms, a correlation heatmap, and a top-10 bar chart. The task introduced both Matplotlib and Seaborn, demonstrating how each library's strengths suit different visualisation types.

In my own words: Visualisation is where analysis becomes communication. A table of IMF estimates is hard to read; a scatter plot showing the relationship between WorldBank and UN estimates makes the pattern immediately obvious. Choosing the right chart type for the data — and knowing which library to use — is as important as the underlying numbers.*

Real-World Context

Organisation type: International economics research team, central bank, global consultancy
Analysts at organisations like the IMF or World Bank present findings to policymakers who need visual summaries of global trends. A correlation heatmap immediately shows which estimates align closely between sources, and a bar chart of top-10 economies gives context to any single-country discussion.

Histogram — Distribution of IMF Estimates

import matplotlib.pyplot as plt

# Full histogram of all numeric columns
df.hist(figsize=(15, 15))
plt.show()

# Focused histogram for IMF Estimate only
df.hist(column='IMF_Estimate', figsize=(6, 4))
plt.show()

Scatter Plot — WorldBank vs UN Estimates (with trend line)

import numpy as np

plt.figure(figsize=(10, 6))

# Scatter points
plt.scatter(df['WorldBank_Estimate'], df['UN_Estimate'],
            color='blue', alpha=0.6)

# Trend line (linear regression)
m, b = np.polyfit(df['WorldBank_Estimate'], df['UN_Estimate'], 1)
x_line = range(int(df['WorldBank_Estimate'].min()),
               int(df['WorldBank_Estimate'].max()))
plt.plot(x_line, [m * x + b for x in x_line], color='red', linewidth=2)

plt.title('UN Estimates vs World Bank Estimates')
plt.xlabel('World Bank Estimate')
plt.ylabel('UN Estimate')
plt.grid(True)
plt.show()

Correlation Heatmap

import seaborn as sns

# Select only numeric columns
numerical_features = df.select_dtypes(include=['number'])

# Calculate correlation matrix
corr = numerical_features.corr()

# Plot heatmap with annotations
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='turbo')
plt.title('Correlation Matrix — GDP Estimates')
plt.show()

Bar Chart — Top 10 Economies by IMF Estimate

gdp_clean.sort_values('IMF_Estimate', ascending=False).head(10).plot(
    kind='bar',
    x='Country/Territory',
    y='IMF_Estimate',
    title='Top 10 Economies by IMF GDP per Capita Estimate',
    figsize=(12, 6),
    color='steelblue'
)
plt.ylabel('IMF GDP per Capita (USD)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Screenshots

Screenshot 1 — Histogram: IMF Estimate Distribution

📸 Run df.hist(column='IMF_Estimate', figsize=(6, 4)). Capture the histogram showing the distribution of GDP per capita values. The x-axis should show USD values and the y-axis should show frequency. The right-skew (most countries clustered at low values, with a long tail to the right) should be visible. Save as images/viz_histogram_imf.png.

Screenshot 2 — Scatter Plot: WorldBank vs UN Estimates

📸 Run the scatter plot code. Capture the chart showing blue dots (WorldBank_Estimate on x-axis, UN_Estimate on y-axis) with a red trend line. The positive correlation (dots trending upward left to right) should be visible. The axis labels and title should be readable. Save as images/viz_scatter_wb_un.png.

Screenshot 3 — Correlation Heatmap

📸 Run the Seaborn heatmap code. Capture the full heatmap showing the correlation matrix of all numeric columns, with colour coding and annotation values visible in each cell. The colour bar legend on the right should be visible. Save as images/viz_heatmap_correlation.png.

Screenshot 4 — Top 10 Economies Bar Chart

📸 Run the bar chart code. Capture the full chart showing 10 countries on the x-axis (rotated labels) and IMF GDP per capita on the y-axis. Monaco, Liechtenstein, or Luxembourg should appear as the tallest bar. The title and axis labels should be clearly readable. Save as images/viz_top10_economies.png.

Key Findings

Finding 1: The correlation heatmap shows very high correlation between IMF, World Bank, and UN estimates (values close to 1.0), which confirms the three sources generally agree — but the scatter plot reveals individual countries where estimates diverge significantly, highlighting cases worth investigating.
Finding 2: The histogram of IMF estimates is strongly right-skewed — the majority of countries have low GDP per capita while a small number of wealthy nations pull the mean far above the median. This is exactly the kind of distribution that makes mean-based comparisons misleading and motivates the use of median or IQR-based analysis.

Portfolio Value

This task demonstrates I can take cleaned data all the way through to visual output — choosing appropriate chart types for different analytical goals (distribution → histogram, relationship → scatter, correlation → heatmap, ranking → bar chart) and customising them to a professional standard. This full pipeline — load, clean, analyse, visualise — is the core deliverable of a junior data analyst role.

🛠️ Tools & Techniques Used

Google Colab — cloud Python/Jupyter environment; Drive mounting, file upload/download
pandas — read_csv, head, tail, info, describe, isnull, dropna, groupby, pivot_table, apply, sort_values, rename, drop, to_csv, str.strip, pd.to_datetime, pd.to_numeric, quantile
NumPy — np.polyfit for trend line calculation
Matplotlib — plt.scatter, plt.plot, plt.hist, df.plot(kind='bar'), plt.grid, plt.title, plt.xlabel, plt.ylabel
Seaborn — sns.heatmap, correlation matrix visualisation
Core Python — lists, tuples, sets, dictionaries, if/elif/else, for loops, while loops, range(), modulo %, shortcut operators, custom functions with def

📂 Datasets & Files

File	Description	Source
`student.csv`	Student names, class, gender, and marks	Bootcamp
`GDP__nominal__per_Capita.csv`	GDP per capita estimates (IMF, World Bank, UN) for ~200 countries	Bootcamp (Kaggle/Wikipedia)
`GDP_nominal_per_capita_clean.csv`	Cleaned version — NaN rows removed, types fixed	Generated in Task 1
`Week4.ipynb`	Google Colab notebook — GDP cleaning and visualisation	Bootcamp Day 4 Task 1
`W6D4_Task_for_workbook__1_.ipynb`	Google Colab notebook — Day 4 Task 2 visualisation	Bootcamp Day 4 Task 2

← Back to Portfolio

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🐍 Python — Data Analytics Portfolio

📁 Contents

⚙️ Setup & Environment — Google Colab

About the Environment

Setup Code

Mounting Google Drive

Uploading a Local File

Screenshot

Portfolio Value

➕ Core Syntax — Shortcuts & Comparison Operators

About the Task

Shortcut Operators

Comparison Operators

Practice Answers (from worksheet)

Screenshot

📦 Data Structures — Lists, Tuples, Sets & Dictionaries

About the Task

Lists

Tuples

Sets

Dictionaries

Screenshots

Portfolio Value

🔀 Control Flow — if / elif / else

About the Task

Customer Tier System (with nested and compound conditions)

Number Comparison

Positive / Negative / Zero Checker

Additional Examples Written Independently

Screenshot

Portfolio Value

🔁 Day 2 · Task 1 — FizzBuzz & While / For Loops

About the Task

Real-World Context

FizzBuzz

While Loops

For Loops with range()

Screenshots

Key Findings

Portfolio Value

📊 Day 3 · Task 1 — Student Data Analysis with pandas

About the Dataset

Real-World Context

Exercise 1 — Loading and Exploring the Data

Exercise 2 — Indexing and Slicing

Exercise 3 — Data Manipulation

Exercise 4 — Aggregation and Grouping

Exercise 5 — Advanced Operations

Exercise 6 — Exporting Data

Screenshots

Key Findings

Portfolio Value

🌍 Day 4 · Task 1 — GDP Data Cleaning & Preparation

About the Dataset

Real-World Context

Task 1 — Load and Inspect

Task 2 — Check and Clean Missing Values

Task 3 — Format and Fix Data Types

Task 4 — Find Outliers (IQR Method)

Save the Cleaned File

Screenshots

Key Findings

Portfolio Value

📈 Day 4 · Task 2 — Data Visualisation

About the Task

Real-World Context

Histogram — Distribution of IMF Estimates

Scatter Plot — WorldBank vs UN Estimates (with trend line)

Correlation Heatmap

Bar Chart — Top 10 Economies by IMF Estimate

Screenshots

Key Findings

Portfolio Value

🛠️ Tools & Techniques Used

📂 Datasets & Files

About

Packages