## Class 1 - Formulating a modeling problem
In our first lecture, we discussed a few overarching points related to this course. Let's do a quick recap:
- The core of this course will be devoted to exploring ways in which we can extract knowledge from data;
- This relies on the fact that we need to be able to "ask questions" to our data;
- Most of these questions will revolve around learning a mathematical or algorithmic model of relations between some features and an outcome, or, when no outcome is available, learning "structures" within our feature space;
- We can do so for two (not mutually exclusive) reasons: to be able to infer the outcome from the features we can observe or to understand how and why inputs and outcomes are related;
- Here, we will mostly focus on developing models which are **good at inferring outcomes from features** in new data.


We emphasized that an important skill for a data scientist is that of being able to identify **questions** that can be answered with data. Let's start getting our hands dirty with this in this first class. Today, the focus will be formulating an interesting predictive questions based on a dataset of your own choice.

### Structure of today's exercise
For this class, your goal is to perform the following tasks:
1. Together with your group, choose one of these datasets (or find a new one)
    - HippoCorpus (a dataset of recalled or imagined stories, paired with a number of story- and participant-related metadata: https://www.kaggle.com/datasets/saurabhshahane/hippocorpus)
    - EEG Psychiatric Disorders Dataset: https://www.kaggle.com/datasets/shashwatwork/eeg-psychiatric-disorders-dataset?resource=download (from this paper: https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2021.707581/full)
    - Personalities and random number choices from OpenPsychometrics: https://openpsychometrics.org/_rawdata/ (search for "random numbers")
    - A large-scale data set containing data from a bike-sharing service and weather information: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset.

2. Load the corresponding data, which you will find under `Project Files/data` using `pandas`

3. Using `pandas` and `seaborn`, get a grasp of the overall characteristics of the dataset:
    - What is the size of your dataset, and how many features are available? 
        - Hint: Use `DataFrame.shape` from `pandas`
    - What kind of information do the columns include?
        - Hint: Read the dataset's documentation + associated papers. Methods like `.info()` or `.describe()` could also be useful.
    - What *types* of variables does each of the column contain? What kind of values do we expect to find in each column?
        - Hint: to extract this information analytically, look into `pandas` `dtype`, `unique`, and `min`/`max` functions
        - To plot this information, use `seaborn` functions (`displot`, `pointplot`, `catplot` or `boxplot` could be helpful)
    - What is the proportion of missing values for each column? Is there any column with a worryingly high proportion of missing values?
        - Hint: use the `.isnull()` method and aggregate over rows using `.sum()`
    - What is the proportion of missing values for each row? Is there any row with a worryingly high proportion of missing values?
        - Hint: very similar to what you did above
    - Is there any very apparent structure in your data, e.g., clusters of highly correlated features? 
        - Hint: use pandas `.corr()` and seaborn `clustermap` to look into that: https://seaborn.pydata.org/generated/seaborn.clustermap.html

4. Think about what information the dataset contains, and formulate one of the following:
    - A prediction question that can be addressed in terms of predictive performance in a regression task;
    - A prediction question that can be addressed in terms of predictive performance in a classification task

5. For the regression OR classification task you have formulated, answer the following questions:
    - What kind of metric can you use to assess whether the model predicts successfully?
    - What is the simplest performance baseline with no predictors you can use to assess your model's accuracy?
    - What is the simplest performance baseline with predictors you can use to assess your model's accuracy?
    - Can you produce some visualizations to get a sense for whether any clear pattern is emerging?
        - Hint: you can use `seaborn` `displot`, `pointplot` or `boxplot` to visualize distributions and their summaries, `scatterplot` or `lmplot` to produce scatterplots 
        (e.g., visualizing relations between variables)

In [24]:
import pandas as pd
import numpy as np
import seaborn as sns

In [28]:
# read the data
random_numbers = pd.read_csv("../../../class_01/randomnumber.csv")

random_numbers.head(24)

Unnamed: 0,submittime,R1,R2,R3,R4,R5,R6,R7,R8,R9,...,O1,O2,O3,O4,O5,O6,O7,O8,O9,O10
0,2015-03-29 10:29:36,56,786,48,479,23,632,31,38,61,...,4,4,2,4,5,5,5,3,5,3
1,2015-03-29 06:24:30,5,500,77,102,65,1223,50,16,14,...,5,1,3,1,4,2,5,5,5,4
2,2015-03-29 02:40:41,8,18,88,666,28,1233,99,27,2,...,4,1,5,1,4,1,4,2,5,4
3,2015-03-28 10:17:47,99,999,98,499,99,1233,99,49,99,...,4,1,5,1,5,1,5,4,5,5
4,2015-03-28 10:15:47,40,500,77,200,50,133,70,17,80,...,3,3,4,2,4,1,4,4,5,3
5,2015-03-28 10:25:14,45,456,77,356,25,567,65,39,65,...,2,3,5,3,3,2,3,2,2,3
6,2015-03-28 10:36:42,27,500,66,101,3,1100,12,42,25,...,4,2,4,1,3,2,3,4,5,4
7,2015-03-28 10:24:33,73,2,50,400,2,500,88,45,90,...,4,1,3,2,4,1,4,4,5,3
8,2015-03-28 10:28:03,17,17,56,345,17,564,17,17,17,...,5,1,5,1,4,1,5,4,5,5
9,2015-03-28 10:31:14,79,979,79,120,79,124,40,8,22,...,3,1,1,1,1,3,5,1,3,3


In [19]:
# check the scale of the questions
min_O1 = min(random_numbers["O1"])
max_O1 = max(random_numbers["O1"])

min_E1 = min(random_numbers["E1"])
max_E1 = max(random_numbers["E1"])

# scale of O1 & E1
print(min_E1, max_E1)

0 5


It seems that the scale is 0 to 5. But Big 5 personality is 1-5. We check how many 0s there are as we assume they may be NAs

In [18]:
random_numbers["E1"].value_counts()

3    414
1    315
2    284
4    254
5    101
0      1
Name: E1, dtype: int64

In [31]:
# looking at the mean of C9
random_numbers.C9.value_counts()

4    413
3    326
2    247
5    214
1    162
0      7
Name: C9, dtype: int64

### Once you have done this
1. Share your answers to 4 and 5 on the Brightspace Padlet
2. Keep your notebooks, I will ask you to briefly run us through them!
3. Next week, we will select a subset of your questions, and work on them for the first few weeks