<a href="https://colab.research.google.com/github/chikoo235/Tetris-in-Java/blob/main/Section_6_Name2Age_%2B_Entropy_Workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 7, Question 2: What's The Best Name To Guess Age?

In this Colab, you'll apply the concept of entropy to solve a problem about decision making under uncertainty. Namely (pun intended), you'll write code to decide which name gives away the most information about a person's age.

To start, run the hidden cell below:

## Part 1: Download Data


The cell below downloads the dataset, then processes it into a format our code will use more easily later.

(It takes a minute or so to run.)

In [1]:
# run me! (don't edit)

# these lines are bash (shell) commands, not Python
! wget -q http://web.stanford.edu/class/cs109/section/5/babynames_small.zip  # download from this URL
! unzip -q babynames_small.zip  # unzip the zip file, so we get a folder of data files

import csv
from collections import defaultdict

def load_data_from_file():
    # a fantastic library for reading csvs
    reader = csv.DictReader(open('babynames_small/data/baby_names.csv'))

    count_map = defaultdict(lambda : 0) # will be dict from (name, year) to count
    all_years = set([]) # will be a set of all years in the dataset, sorted
    all_names = set() # will be a set of all names in the dataset

    # we will count all instances of each name, and keep only common-ish names
    names_to_total_counts = defaultdict(lambda : 0)

    # loop over all the rows in the CSV file
    for row in reader:
        # row looks like a dictionary, with keys Year, Name, Count
        year = int(row['Year'])
        name = row['Name']
        num_babies = int(row['Count'])

        # social security applications before 1914 are very biased,
        # so let's skip those
        if year < 1914: continue

        all_years.add(year)
        all_names.add(name)
        count_map[(name, year)] += num_babies

        names_to_total_counts[name] += num_babies

    # sort years from lowest to highest
    all_years = sorted(all_years)

    all_names = [name for name in all_names if names_to_total_counts[name] > 10000]

    return all_names, all_years, count_map


all_names, all_years, count_map = load_data_from_file()

## Part 2: Inference

The function `inference_P_age_given_name(name, years_list)` returns the distribution for $P(\text{Age} = a | \text{Name} = n)$, for all years, as a dictionary. We have written the inference code for you here! But for good practice you could try to write out this code for yourself.

Run the cell below so you can use these functions later!

In [2]:
# run me (don't edit); your code later will use inference_P_age_given_name()

def get_count(name, year, data = count_map):
	"""
	Returns the number of babies born in a given year.
    Returns 0 if the (name, year) pair is not in the dataset.
	"""
	return data[(name, year)] if (name, year) in data else 0


def inference_P_age_given_name(name, years_list = all_years):
	'''
	Return a dictionary, where the keys are different ages, and the values
	are the probability that someone is that age, given their name.

	(we wrote this function in section 5!)
	'''

	CURRENT_YEAR = 2025

	distribution = {}

	total_count = 0
	for year in years_list:
		total_count += get_count(name, year)

	if total_count == 0:
		return {}

	for year in years_list:
		prob_year = get_count(name, year) / total_count
		distribution[CURRENT_YEAR - year] = prob_year

	return distribution

## Part 3: Entropy

**Our Goal:** find the name that leaks the most information about a person's age. Put another way, find the name that results in the lowest possible uncertainty (or entropy) about a person's age.

First, write a function that calculates entropy for a distribution.

Then, apply this function and `inference_P_age_given_name` to find the name $n$ that minimizes uncertainty for the distribution $P(\text{Age} = a | \text{Name} = n)$.

Recall: we have the lists `all_names` and `all_years`, which contain all the names (or years) in our dataset.

In [3]:
import math
import numpy as np


def calc_entropy(pmf_dict):
    entropy = 0
    # TODO: your code here
    return entropy


def find_best_name(all_names = all_names, all_years = all_years):
    # TODO: your code here
    return "Chris"

find_best_name()

'Chris'

What name does your code output? Does that result make sense? You can check the distribution of ages for people with that name using the code below.

In [4]:
# If you want to check the PMF for any name, to verify your answer above:

import pandas as pd
import plotly.express as px

def plot_PMF(names, distributions):
    names = [name.strip() for name in names]

    if len(names) == 1:
        df = pd.DataFrame(list(distributions[0].items()), columns=['Age', 'Probability'])
        fig = px.line(df, x="Age", y="Probability")

        title = fr"$\Huge P(\text{{Age}} = a | \text{{Name = {names[0]}}})$"
        fig.update_layout(
            title = dict(text=title, x=0.5, xanchor='center'),
            xaxis = dict(title=dict(text="Age", font=dict(size=20))),
            yaxis = dict(title=dict(text="Probability", font=dict(size=20)))
        )
        fig.show()

    else:
        df_list = []
        for name, dist in zip(names, distributions):
            if len(dist) == 0:  # Skip empty distributions
              continue
            df_one_name = pd.DataFrame(list(dist.items()), columns=['Age', 'Probability'])
            df_one_name['Name'] = name
            df_list.append(df_one_name)

        df = pd.concat(df_list)

        fig = px.line(df, x="Age", y="Probability", color="Name")

        title=fr"$\Huge P(\text{{Age}} = a | \text{{Name}} = n)$"
        fig.update_layout(
            title = dict(text=title, x=0.5, xanchor='center'),
            xaxis = dict(title=dict(text="Age", font=dict(size=20))),
            yaxis = dict(title=dict(text="Probability", font=dict(size=20))),
            legend = dict(title=dict(font=dict(size=20)), font=dict(size=14))
        )

        fig.show()

# List of names to plot the conditional probabilities of vs. age
NAMES = ['June']

plot_PMF(NAMES, [inference_P_age_given_name(name) for name in NAMES])