# Introduction to Data Science 2025

# Week 2

## Exercise 1 | Titanic: data preprocessing and imputation
<span style="font-weight: bold"> *Note: You can find tutorials for NumPy and Pandas under 'Useful tutorials' in the course material.*</span>

Download the [Titanic dataset](https://www.kaggle.com/c/titanic) [train.csv] from Kaggle or <span style="font-weight: 500">directly from the course material</span>, and complete the following exercises. If you choose to download the dataset from Kaggle, you will need to create a Kaggle account unless you already have one, but it is quite straightforward.

The dataset consists of personal information of all the passengers on board the RMS Titanic, along with information about whether they survived the iceberg collision or not.

1. Your first task is to read the data file and print the shape of the data.

    <span style="font-weight: 500"> *Hint 1: You can read them into a Pandas dataframe if you wish.*</span>
    
    <span style="font-weight: 500"> *Hint 2: The shape of the data should be (891, 12).*</span>

In [40]:
#"C:\Users\HP LAPTOP 15S-FQ2023\titanic_data"
import pandas as pd

# Use the full path to your CSV file
train = pd.read_csv(r"C:\Users\HP LAPTOP 15S-FQ2023\titanic_data\train.csv")
train_original=train.copy


# Show the first 5 rows to confirm it loaded correctly
print(train.head())
print(train.shape)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
(8

2. Let's look at the data and get started with some preprocessing. Some of the columns, e.g <span style="font-weight: 500"> *Name*</span>, simply identify a person and are not useful for prediction tasks. Try to identify these columns, and remove them.

    <span style="font-weight: 500"> *Hint: The shape of the data should now be (891, 9).*</span>

In [41]:
train=train.drop(["PassengerId", "Name", "Ticket"], axis=1)
print(train)
print(train.shape)

     Survived  Pclass     Sex   Age  SibSp  Parch     Fare Cabin Embarked
0           0       3    male  22.0      1      0   7.2500   NaN        S
1           1       1  female  38.0      1      0  71.2833   C85        C
2           1       3  female  26.0      0      0   7.9250   NaN        S
3           1       1  female  35.0      1      0  53.1000  C123        S
4           0       3    male  35.0      0      0   8.0500   NaN        S
..        ...     ...     ...   ...    ...    ...      ...   ...      ...
886         0       2    male  27.0      0      0  13.0000   NaN        S
887         1       1  female  19.0      0      0  30.0000   B42        S
888         0       3  female   NaN      1      2  23.4500   NaN        S
889         1       1    male  26.0      0      0  30.0000  C148        C
890         0       3    male  32.0      0      0   7.7500   NaN        Q

[891 rows x 9 columns]
(891, 9)


3. The column <span style="font-weight: 500">*Cabin*</span> contains a letter and a number. A smart catch at this point would be to notice that the letter stands for the deck level on the ship. Keeping just the deck information would be more informative when developing, e.g. a classifier that predicts whether a passenger survived. The next step in our preprocessing will be to add a new column to the dataset, which consists simply of the deck letter. You can then remove the original <span style="font-weight: 500">*Cabin*</span>-column.

<span style="font-weight: 500">*Hint: The deck letters should be ['A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'].*</span>

In [42]:
train["Deck"]=train["Cabin"].str[0]
train=train.drop("Cabin", axis=1)
print(train["Deck"].unique())
print(train.shape)


[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
(891, 9)


4. You’ll notice that some of the columns, such as the previously added deck number, are [categorical](https://en.wikipedia.org/wiki/Categorical_variable). To preprocess the categorical variables so that they're ready for further computation, we need to avoid the current string format of the values. This means the next step for each categorical variable is to transform the string values to numeric ones, that correspond to a unique integer ID representative of each distinct category. This process is called label encoding and you can read more about it [here](https://pandas.pydata.org/docs/user_guide/categorical.html).

    <span style="font-weight: 500">*Hint: Pandas can do this for you.*</span>

In [None]:
for col in ["Sex", "Embarked", "Deck"]:
    train[col]=train[col].astype("category").cat.codes

print(train.head())
#sex: 1:male, 2:female


   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked  Deck
0         0       3    1  22.0      1      0   7.2500         2    -1
1         1       1    0  38.0      1      0  71.2833         0     2
2         1       3    0  26.0      0      0   7.9250         2    -1
3         1       1    0  35.0      1      0  53.1000         2     2
4         0       3    1  35.0      0      0   8.0500         2    -1


5. Next, let's look into missing value **imputation**. Some of the rows in the data have missing values, e.g when the cabin number of a person is unknown. Most machine learning algorithms have trouble with missing values, and they need to be handled during preprocessing:

    a) For continuous variables, replace the missing values with the mean of the non-missing values of that column.

    b) For categorical variables, replace the missing values with the mode of the column.

    <span style="font-weight: 500">*Remember: Even though in the previous step we transformed categorical variables into their numeric representation, they are still categorical.*</span>

In [48]:
#first, identify the coloumns that have missing data
print(train.isnull().sum())
#age
train["Age"].fillna(train["Age"].mean(), inplace=True)
train["Embarked"].fillna(train["Embarked"].mode()[0], inplace=True)
train["Deck"].fillna(train["Deck"].mode()[0], inplace=True)
print(train.isnull().sum())
print(train.head())


Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
Deck        0
dtype: int64
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
Deck        0
dtype: int64
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked  Deck
0         0       3    1  22.0      1      0   7.2500         2    -1
1         1       1    0  38.0      1      0  71.2833         0     2
2         1       3    0  26.0      0      0   7.9250         2    -1
3         1       1    0  35.0      1      0  53.1000         2     2
4         0       3    1  35.0      0      0   8.0500         2    -1


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train["Age"].fillna(train["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train["Embarked"].fillna(train["Embarked"].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object o

6. At this point, all data is numeric. Write the data, with the modifications we made, to a  <span style="font-weight: 500"> .csv</span> file. Then, write another file, this time in <span style="font-weight: 500">JSON</span> format, with the following structure:

In [6]:
#[
#    {
#        "Deck": 0,
#        "Age": 20,
#        "Survived", 0
#        ...
#    },
#    {
#        ...
#    }
#]

In [51]:
train.to_csv(r"C:\Users\HP LAPTOP 15S-FQ2023\titanic_data\train_cleaned.csv", index=False)
#index=False ensures the row numbers are not included in the file.
#JSON
#"records" → list of dictionaries (good for row-wise data)
#"index" → dictionary of dictionaries keyed by index
#"columns" → dictionary of lists keyed by column name
train.to_json(r"C:\Users\HP LAPTOP 15S-FQ2023\titanic_data\train_cleaned.json",
               orient="records") # # lines=True gives one JSON object per line

#CSV file
# Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Deck
# 0,3,1,22.0,1,0,7.25,2,4
# 1,1,0,38.0,1,0,71.28,0,2
# 1,3,0,26.0,0,0,7.92,2,3

#JSON output with lines=True #not a list of dicitonaries(Each row is a separate JSON object on a new line.) 
# {"Survived":0,"Pclass":3,"Sex":1,"Age":22.0,"SibSp":1,"Parch":0,"Fare":7.25,"Embarked":2,"Deck":4}
# {"Survived":1,"Pclass":1,"Sex":0,"Age":38.0,"SibSp":1,"Parch":0,"Fare":71.28,"Embarked":0,"Deck":2}
# {"Survived":1,"Pclass":3,"Sex":0,"Age":26.0,"SibSp":0,"Parch":0,"Fare":7.92,"Embarked":2,"Deck":3}

#JSON output without lines=True, (proper JSON list of dicitonaries, can be directly loaded with json.load())
# [
#   {"Survived":0,"Pclass":3,"Sex":1,"Age":22.0,"SibSp":1,"Parch":0,"Fare":7.25,"Embarked":2,"Deck":4},
#   {"Survived":1,"Pclass":1,"Sex":0,"Age":38.0,"SibSp":1,"Parch":0,"Fare":71.28,"Embarked":0,"Deck":2},
#   {"Survived":1,"Pclass":3,"Sex":0,"Age":26.0,"SibSp":0,"Parch":0,"Fare":7.92,"Embarked":2,"Deck":3}
# ]



Study the records and try to see if there is any evident pattern in terms of chances of survival.

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 2 | Titanic 2.0: exploratory data analysis

In this exercise, we’ll continue to study the Titanic dataset from the last exercise. Now that we have done some preprocessing, it’s time to look at the data with some exploratory data analysis.

1. First investigate each feature variable in turn. For each categorical variable, find out the mode, i.e., the most frequent value. For numerical variables, calculate the median value.

In [56]:
survived_mode=train["Survived"].mode()[0]
print("Most frequent Survived:", survived_mode)
pclass_mode=train["Pclass"].mode()[0]
print("Most frequent Pclass:", pclass_mode)
sex_mode=train["Sex"].mode()[0]
print("Most frequent Sex:", sex_mode)
age_median=train["Age"].median()
print("Median Age:", age_median)
sibsp_median=train["SibSp"].median()
print("Number of siblings aboard:", sibsp_median)

parch_median=train["Parch"].median()
print("Number of Parents aboard:", parch_median)

fare_median=train["Fare"].median()
print("Median Fare:", fare_median)

embarked_mode=train["Embarked"].mode()[0]
print("Most frequent Embarked:", embarked_mode)

deck_mode=train["Deck"].mode()[0]
print("Most frequent Deck:", deck_mode)




Most frequent Survived: 0
Most frequent Pclass: 3
Most frequent Sex: 1
Median Age: 29.69911764705882
Number of siblings aboard: 0.0
Number of Parents aboard: 0.0
Median Fare: 14.4542
Most frequent Embarked: 2
Most frequent Deck: -1


2. Next, combine the modes of the categorical variables, and the medians of the numerical variables, to construct an imaginary “average survivor”. This "average survivor" should represent the typical passenger of the class of passengers who survived. Also following the same principle, construct the “average non-survivor”.

    <span style="font-weight: 500">*Hint 1: What are the average/most frequent variable values for a non-survivor?*</span>
    
    <span style="font-weight: 500">*Hint 2: You can split the dataframe in two: one subset containing all the survivors and one consisting of all the non-survivor instances. Then, you can use the summary statistics of each of these dataframe to create a prototype "average survivor" and "average non-survivor", respectively.*</span>

In [61]:
#split the dataset

survivors=train[train["Survived"]==1]
#selecting rows, creating the condition(Boolean series)
non_survivors=train[train["Survived"]==0]

numerical_cols = ["Age", "SibSp", "Parch", "Fare"]  # adjust based on your dataset
survivor_medians = survivors[numerical_cols].median()



categorical_cols=["Pclass", "Sex", "Embarked", "Deck"]
survivor_modes=survivors[categorical_cols].mode().iloc[0]

#non_survivors
non_survivor_medians = non_survivors[numerical_cols].median()
non_survivor_modes = non_survivors[categorical_cols].mode().iloc[0]

average_survivor = pd.concat([survivor_modes, survivor_medians])
#combining 2 series of modes and medians
#result is a pandas series
print("Average Survivor:\n", average_survivor)

average_non_survivor=pd.concat([non_survivor_modes, non_survivor_medians])
print("Average non_survivor:\n", average_non_survivor)

average_passengers = pd.DataFrame([average_survivor, average_non_survivor], index=["Survivor", "Non-Survivor"])
print(average_passengers)

Average Survivor:
 Pclass       1.000000
Sex          0.000000
Embarked     2.000000
Deck        -1.000000
Age         29.699118
SibSp        0.000000
Parch        0.000000
Fare        26.000000
dtype: float64
Average non_survivor:
 Pclass       3.000000
Sex          1.000000
Embarked     2.000000
Deck        -1.000000
Age         29.699118
SibSp        0.000000
Parch        0.000000
Fare        10.500000
dtype: float64
              Pclass  Sex  Embarked  Deck        Age  SibSp  Parch  Fare
Survivor         1.0  0.0       2.0  -1.0  29.699118    0.0    0.0  26.0
Non-Survivor     3.0  1.0       2.0  -1.0  29.699118    0.0    0.0  10.5


3. Next, let's study the distributions of the variables in the two groups (survivor/non-survivor). How well do the average cases represent the respective groups? Can you find actual passengers that are very similar to the (average) representative of their own group? Can you find passengers that are very similar to the (average) representative of the other group?

    <span style="font-weight: 500">*Note: Feel free to choose EDA methods according to your preference: non-graphical/graphical, static/interactive - anything goes.*</span>

In [80]:
# Convert categorical columns first
for col in ["Sex", "Embarked", "Deck"]:
    train[col] = train[col].astype("category")

# Now create subsets
survivors = train[train["Survived"] == 1].copy()
non_survivors = train[train["Survived"] == 0].copy()

# Now describe
#Survivors
print("Survivors summary:")
print(survivors.describe(include="all"))

#Non-survivors
print("\nNon-survivors summary:")
print(non_survivors.describe(include="all"))

#Counts how many times instance from each category appears
for col in ["Sex", "Embarked", "Deck"]:
    print(f"Survivors - {col} distribution:\n", survivors[col].value_counts())
    print(f"Non-survivors - {col} distribution:\n", non_survivors[col].value_counts())

#For a more visual check
# add plots histograms/boxplots for numeric
# bar charts for categorical
#CHECK THIS CODE FOR VISUL LATER ON
# import matplotlib.pyplot as plt

# for col in ["Age", "Fare", "SibSp", "Parch"]:
#     plt.figure()
#     train_copy.groupby("Survived")[col].plot(kind="kde", legend=True)
#     plt.title(f"Distribution of {col} by survival")
#     plt.show()

# for col in ["Sex", "Embarked", "Pclass", "Deck"]:
#     print(f"\n{col} distribution by survival:")
#     print(train_copy.groupby("Survived")[col].value_counts(normalize=True))

    import numpy as np

numerical_cols=["Age", "Fare", "SibSp", "Parch"]
categorical_cols=["Sex", "Embarked", "Pclass", "Deck"]

# Compute distances for survivors

survivors["distance"]=np.linalg.norm(
    survivors[numerical_cols]-average_survivor[numerical_cols], axis=1)

#Explanation
# survivors[numerical_cols]: a DataFrame with just the numeric columns (Age, Fare, SibSp, Parch) for all survivors
# Example shape: (342, 4).
# average_survivor[numerical_cols]: a Series containing the median values of those numeric columns for survivors (one "average" case) (go up to part b if you dont understand)
# np.linalg.norm(..., axis=1) → computes the Euclidean distance (L2 norm) of each row’s difference vector
#for the Euclian distance it makes sense to work only with numeric coloumns
# axis=1 means: compute across columns, so you get one distance per row (passenger)
# Result: a 1D NumPy array with the same length as the number of survivors
#Finally, we assign that to a new column: survivors["distance"]

closest_survivor=survivors.loc[survivors["distance"].idxmin()]
print("Closest passenger to average survivor:\n", closest_survivor)

# survivors["distance"].idxmin(): finds the index (row label) of the passenger with the smallest distance
# (i.e., the passenger numerically most similar to the average survivor).
# .loc[...] → retrieves that exact row from the DataFrame.
# closest_survivor is a single row (Series) representing the real passenger closest to the "average survivor".

non_survivors["distance"]=np.linalg.norm(
    non_survivors[numerical_cols]-average_non_survivor[numerical_cols], axis=1
)

closest_non_survivor=non_survivors.loc[non_survivors["distance"].idxmin()]
print("Closest passenger to non_survivor", closest_non_survivor)







Survivors summary:
        Survived      Pclass    Sex         Age       SibSp       Parch  \
count      342.0  342.000000  342.0  342.000000  342.000000  342.000000   
unique       NaN         NaN    2.0         NaN         NaN         NaN   
top          NaN         NaN    0.0         NaN         NaN         NaN   
freq         NaN         NaN  233.0         NaN         NaN         NaN   
mean         1.0    1.950292    NaN   28.549778    0.473684    0.464912   
std          0.0    0.863321    NaN   13.772498    0.708688    0.771712   
min          1.0    1.000000    NaN    0.420000    0.000000    0.000000   
25%          1.0    1.000000    NaN   21.000000    0.000000    0.000000   
50%          1.0    2.000000    NaN   29.699118    0.000000    0.000000   
75%          1.0    3.000000    NaN   35.000000    1.000000    1.000000   
max          1.0    3.000000    NaN   80.000000    4.000000    5.000000   

              Fare  Embarked   Deck  
count   342.000000     342.0  342.0  
uniq

In [81]:
for col in ["Sex", "Embarked", "Deck"]:
    print(f"Survivors - {col} distribution:\n", survivors[col].value_counts())
    print(f"Non-survivors - {col} distribution:\n", non_survivors[col].value_counts())



Survivors - Sex distribution:
 Sex
0    233
1    109
Name: count, dtype: int64
Non-survivors - Sex distribution:
 Sex
1    468
0     81
Name: count, dtype: int64
Survivors - Embarked distribution:
 Embarked
 2    217
 0     93
 1     30
-1      2
Name: count, dtype: int64
Non-survivors - Embarked distribution:
 Embarked
 2    427
 0     75
 1     47
-1      0
Name: count, dtype: int64
Survivors - Deck distribution:
 Deck
-1    206
 1     35
 2     35
 3     25
 4     24
 5      8
 0      7
 6      2
 7      0
Name: count, dtype: int64
Non-survivors - Deck distribution:
 Deck
-1    481
 2     24
 1     12
 0      8
 3      8
 4      8
 5      5
 6      2
 7      1
Name: count, dtype: int64


In [75]:
import numpy as np

# For survivors
numeric_and_categorical_cols = numerical_cols + categorical_cols

# Distance from average survivor
distances_to_survivor_avg = survivors[numeric_and_categorical_cols].apply(
    lambda row: np.linalg.norm(row - average_survivor[numeric_and_categorical_cols]),
    axis=1
)

# Smallest distance = most similar passenger
most_typical_survivor = survivors.loc[distances_to_survivor_avg.idxmin()]
print("Most typical survivor:\n", most_typical_survivor)

distances_to_non_survivor_avg = non_survivors[numeric_and_categorical_cols].apply(
    lambda row: np.linalg.norm(row - average_non_survivor[numeric_and_categorical_cols]),
    axis=1
)

most_typical_non_survivor = non_survivors.loc[distances_to_non_survivor_avg.idxmin()]
print("Most typical non-survivor:\n", most_typical_non_survivor)

Most typical survivor:
 Survived     1.000000
Pclass       1.000000
Sex          1.000000
Age         29.699118
SibSp        0.000000
Parch        0.000000
Fare        26.550000
Embarked     2.000000
Deck        -1.000000
Name: 507, dtype: float64
Most typical non-survivor:
 Survived     0.000000
Pclass       3.000000
Sex          1.000000
Age         29.699118
SibSp        0.000000
Parch        0.000000
Fare         9.500000
Embarked     2.000000
Deck        -1.000000
Name: 868, dtype: float64


In [None]:
#find survivor closest to non_survivor average
# dist_survivor_to_non_avg = survivors[numeric_and_categorical_cols].apply(
#     lambda row: np.linalg.norm(row - average_non_survivor[numeric_and_categorical_cols]),
#     axis=1
# )
# closest_survivor_to_non = survivors.loc[dist_survivor_to_non_avg.idxmin()]
# print("Survivor most similar to non-survivor average:\n", closest_survivor_to_non)


Survivor most similar to non-survivor average:
 Survived     1.0
Pclass       3.0
Sex          1.0
Age         30.0
SibSp        0.0
Parch        0.0
Fare         9.5
Embarked     2.0
Deck        -1.0
Name: 286, dtype: float64


In [82]:
import numpy as np

numerical_cols=["Age", "Fare", "SibSp", "Parch"]
categorical_cols=["Sex", "Embarked", "Pclass", "Deck"]
# Compute distances for survivors
survivors["distance"]=np.linalg.norm(
    survivors[numerical_cols]-average_survivor[numerical_cols], axis=1)

closest_survivor=survivors.loc[survivors["distance"].idxmin()]
print("Closest passenger to average survivor:\n", closest_survivor)

Closest passenger to average survivor:
 Survived     1.000000
Pclass       1.000000
Sex          1.000000
Age         29.699118
SibSp        0.000000
Parch        0.000000
Fare        26.550000
Embarked     2.000000
Deck        -1.000000
distance     0.550000
Name: 507, dtype: float64


In [83]:
non_survivors["distance"]=np.linalg.norm(
    non_survivors[numerical_cols]-average_non_survivor[numerical_cols], axis=1
)

closest_non_survivor=non_survivors.loc[non_survivors["distance"].idxmin()]
print("Closest passenger to non_survivor", closest_non_survivor)

Closest passenger to non_survivor Survived     0.000000
Pclass       2.000000
Sex          1.000000
Age         30.000000
SibSp        0.000000
Parch        0.000000
Fare        10.500000
Embarked     2.000000
Deck        -1.000000
distance     0.300882
Name: 219, dtype: float64


4. Next, let's continue the analysis by looking into pairwise and multivariate relationships between the variables in the two groups. Try to visualize two variables at a time using, e.g., scatter plots and use a different color to encode the survival status.

    <span style="font-weight: 500">*Hint 1: You can also check out Seaborn's pairplot function, if you wish.*</span>

    <span style="font-weight: 500">*Hint 2: To better show many data points with the same value for a given variable, you can use either transparency or ‘jitter’.*</span>

In [11]:
# Use this cell for your code

5. Finally, recall the preprocessing we did in the first exercise. What can you say about the effect of the choices that were made to use the mode and mean to impute missing values, instead of, for example, ignoring passengers with missing data?

*Use this (markdown) cell for your written answer*

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 3 | Working with text data 2.0

This exercise is related to the second exercise from last week. Find the saved <span style="font-weight: 500">pos.txt</span> and <span style="font-weight: 500">neg.txt</span> files, or, alternatively, you can find the week 1 example solutions on the MOOC platform after Tuesday.

1. Find the most common words in each file (positive and negative). Examine the results. Do they tend to be general terms relating to the nature of the data? How well do they indicate positive/negative sentiment?

In [12]:
# Use this cell for your code

2. Compute a [TF/IDF](https://en.wikipedia.org/wiki/Tf–idf) vector for each of the two text files, and make them into a <span style="font-weight: 500">2 x m</span> matrix, where <span style="font-weight: 500">m</span> is the number of unique words in the data. The problem with using the most common words in a review to analyze its contents is that words that are common overall will be common in all reviews (both positive and negative). This means that they probably are not good indicators about the sentiment of a specific review. TF/IDF stands for Term Frequency / Inverse Document Frequency (here the reviews are the documents), and is designed to help by taking into consideration not just the number of times a term occurs (term frequency), but also how many times a word exists in other reviews as well (inverse document frequency). You can use any variant of the formula, as well as off-the-shelf implementations. <span style="font-weight: 500">*Hint: You can use [sklearn](http://scikit-learn.org/).*</span>

In [13]:
# Use this cell for your code

3. List the words with the highest TF/IDF score in each class (positive | negative), and compare them to the most common words. What do you notice? Did TF/IDF work as expected?

In [14]:
# Use this cell for your code

4. Plot the words in each class with their corresponding TF/IDF scores. Note that there will be a lot of words, so you’ll have to think carefully to make your chart clear! If you can’t plot them all, plot a subset – think about how you should choose this subset.

    <span style="font-weight: 500">*Hint: you can use word clouds. But feel free to challenge yourselves to think of any other meaningful way to visualize this information!*</span>

In [15]:
# Use this cell for your code

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 4 | Junk charts

There’s a thriving community of chart enthusiasts who keep looking for statistical graphics that they find inappropriate, and which they call “junk charts”, and who often also propose ways to improve them.

1. Find at least three statistical visualizations you think are not very good and identify their problems. Copying examples from various junk chart websites is not accepted – you should find your own junk charts, out in the wild. You should be able to find good (or rather, bad) examples quite easily since a significant fraction of charts can have at least *some* issues. The examples you choose should also have different problems, e.g., try to avoid collecting three bar charts, all with problematic axes. Instead, try to find as interesting and diverse examples as you can.

2. Try to produce improved versions of the charts you selected. The data is of course often not available, but perhaps you can try to extract it, at least approximately, from the chart. Or perhaps you can simulate data that looks similar enough to make the point.



**Submit a PDF with all the charts (the ones you found and the ones you produced).**