# Class 6 homework - More on different plots for different data

This homework continues the ideas that:

1. We can generate statistically random data from known distributions.
2. Different kinds of plots work best for different kinds of data.

We will generate data from different distributions, including with different structure and correlations. We will make informative plots of them.



## Load modules

The random number generators that we will use are in the numpy (numerical Python) module, so we need to load that along with pandas and seaborn.

Enter the import statements that you need below. These should be familiar by now, class 6.

## Different distributions of pseudorandom numbers

The numpy pseudo random number generator supplies different methods to generate different distributions, including:

- uniformly distributed between 0 and 1, `random`, that we used in class 6.
- normal distribution, `normal`, that is ...
- exponential distribution, `exponential`, that is ...
- Poisson distribution, `poisson`, that is ...
- and many more

These are used in a similar way to `.random`

In [None]:
# set up a random number generator
rng_seed1 = np.random.default_rng(seed=1)
# use it to generate a random number
rng_seed1.random()

In [None]:
rng_seed1.normal()

In [None]:
rng_seed1.exponential()

In [None]:
rng_seed1.poisson()

Generating random numbers, then plotting them, helps to understand the differences between these distributions.

We will quickly generate some random numbers from these distributions, in data frames of either 10 or 1000 points, and plot them.

In [None]:
# set up a random number generator
rng_seed2025 = np.random.default_rng(seed=2025)

# generate a data frame with 10 random numbers of each
df_random10 = pd.DataFrame(
    {"unif" : rng_seed2025.random(size = 10), 
     "norm" : rng_seed2025.normal(size = 10),
     "expo" : rng_seed2025.exponential(size = 10),
     "pois" : rng_seed2025.poisson(size = 10)
    })

# generate a data frame with 1000 random numbers of each
df_random1000 = pd.DataFrame(
    {"unif" : rng_seed2025.random(size = 1000), 
     "norm" : rng_seed2025.normal(size = 1000),
     "expo" : rng_seed2025.exponential(size = 1000),
     "pois" : rng_seed2025.poisson(size = 1000)
    })

First, work on describing these data frames.

What are the means? What do you expect to see in plots?

*Reminder: The statistics from small samples may not be the same as from the larger samples... averages are only on average.*

Try now plotting some of these using the 1-d distribution plots from Class 6:

- histplot
- kdeplot
- ecdfplot

## The pairplot as a way to make rapid comparisons

The pairplot of a data frame allows quick inspection of distributions and correlations. By plotting these next to each other, pairplots can make differences and similarities clearer.

By default, the pairplot has the 1-d distribution (e.g. histogram) on the diagonal, and scatterplots off the diagonal.

Try making pairplots of the small dataset and the large dataset.

The structure of the distributions is hard to see from a sample of 10 points, but much easier with a sample of 1000.

Now try adjusting the plots in a way that's most informative.
First, read the [seaborn help page for pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html). 
Then, start experimenting.

You could try `kind = "kde"`, to start.

See what happens if you adjust the `Corner` argument, and decide whether that makes the plot more or less clear.

Which plots are most informative? How does that depend on the size and shape of the data?

## Generating correlated data

Next, we will generate some data with known correlations and see how that leads to patterns in plotting.

We will do this by first generating some deterministic and some random numbers, then creating new columns based on combining them.

In [None]:
# reseed random number generator with a different seed
rng_seed329 = np.random.default_rng(seed=329)

# generate a data frame with 10 random numbers of each
df_correlated100 = pd.DataFrame(
    {"a" : rng_seed329.random(size = 100),
     "b" : rng_seed329.normal(size = 100),
     "c" : rng_seed329.normal(size = 100)
    })

df_correlated100["a_plus_b"] = df_correlated100["a"] + df_correlated100["b"] 
df_correlated100["a5x_plus_c"] = 5 * df_correlated100["a"] + df_correlated100["c"] 
df_correlated100["b_minus_c"] = df_correlated100["b"] - df_correlated100["c"] 



Before you go any further, try to understand this code. What did we do? What tools from the course did we use? How many columns will the resulting data frame have?

In [None]:
# Space to trial some of these

Make sure you check the data frame too, using describe.

Now, make a pairplot. Be sure to relate the curves that you see on the plot with the numbers in `.describe()`.

How can you interpret this?

Try redoing the pairplot with `corner = True` and drawing regression lines with `kind = "reg"`.

To interpret this plot, consider one panel at a time.

The panels on the diagonal are 1-d distributions, here histograms, of each variable individually.

- a is uniformly distributed
- b and c are normally distributed
- distributions of other variables reflect how they were created as sums of columns

Off the diagonal, the variables are scatter plots with linear regression.

- a vs b is not correlated, because they're independent
- a_plus_b is weakly positively correlated to a, because it was created from adding a to b
- a_plus_b is strongly correlated with b,
- **Keep going for all the panels**



It can be overwhelming to focus on only one plot from a big set like this. It makes sense to pick the most interesting from the pair plot and plot them again.

Try that with a lmplot of just a5x_plus_c against a.

*Tip: this plot shows the kind of data that linear regression is most suitable for - here a5x_plus_c is a variable that strongly depends on a, but has added variability with a normal distribution.*



Plot some other single panels, and try different visualisations, to understand the relationships better.

Maybe try a joint plot?

Try a 2d distribution plot as a kernel density estimate (`kde`). This shows how 2-d kernel density estimates highlight correlations between variables.


Lastly, try the pairplot as kernel density estimate (`kind = "kde"`).

Again, systematically think through the panels to understand what the plot might be showing you about the data distribution.

## Conclusion

Dealing with real data, as in the group project, this kind of systematic data exploration is an essential activity. The plots are tools to understand the data, and different kinds of plots are better for different kinds of data.

In this homework, we used randomly generated numbers with known properties to show how different plotting strategies can help us see those properties.

### Note: Why we used the numpy random number generator

Python usually has multiple ways to do the same thing, which can be confusing. There are different ways to generate random numbers in Python, including the `random` module. 

The numpy random number generators used here have a consistent way of generating an array of random numbers at the same time, which is helpful. The numpy random number generator documentation is quite technical and mathematical, beware. Alternative distributions are listed at [numpy Random Generator distributions list](https://numpy.org/doc/2.1/reference/random/generator.html#distributions).

Generating random numbers will not be on the class test. (It will be in the next class quiz.)

Plotting data and interpreting plots will be on the class test.

# Class 6 homework - More on different plots for different data

This homework continues the ideas that:

1. We can generate statistically random data from known distributions.
2. Different kinds of plots work best for different kinds of data.

We will generate data from different distributions, including with different structure and correlations. We will make informative plots of them.



## Load modules

The random number generators that we will use are in the numpy (numerical Python) module, so we need to load that along with pandas and seaborn.

Enter the import statements that you need below. These should be familiar by now, class 6.

## Different distributions of pseudorandom numbers

The numpy pseudo random number generator supplies different methods to generate different distributions, including:

- uniformly distributed between 0 and 1, `random`, that we used in class 6.
- normal distribution, `normal`, that is ...
- exponential distribution, `exponential`, that is ...
- Poisson distribution, `poisson`, that is ...
- and many more

These are used in a similar way to `.random`

In [None]:
# set up a random number generator
rng_seed1 = np.random.default_rng(seed=1)
# use it to generate a random number
rng_seed1.random()

In [None]:
rng_seed1.normal()

In [None]:
rng_seed1.exponential()

In [None]:
rng_seed1.poisson()

Generating random numbers, then plotting them, helps to understand the differences between these distributions.

We will quickly generate some random numbers from these distributions, in data frames of either 10 or 1000 points, and plot them.

In [None]:
# set up a random number generator
rng_seed2025 = np.random.default_rng(seed=2025)

# generate a data frame with 10 random numbers of each
df_random10 = pd.DataFrame(
    {"unif" : rng_seed2025.random(size = 10), 
     "norm" : rng_seed2025.normal(size = 10),
     "expo" : rng_seed2025.exponential(size = 10),
     "pois" : rng_seed2025.poisson(size = 10)
    })

# generate a data frame with 1000 random numbers of each
df_random1000 = pd.DataFrame(
    {"unif" : rng_seed2025.random(size = 1000), 
     "norm" : rng_seed2025.normal(size = 1000),
     "expo" : rng_seed2025.exponential(size = 1000),
     "pois" : rng_seed2025.poisson(size = 1000)
    })

First, work on describing these data frames.

What are the means? What do you expect to see in plots?

*Reminder: The statistics from small samples may not be the same as from the larger samples... averages are only on average.*

Try now plotting some of these using the 1-d distribution plots from Class 6:

- histplot
- kdeplot
- ecdfplot

## The pairplot as a way to make rapid comparisons

The pairplot of a data frame allows quick inspection of distributions and correlations. By plotting these next to each other, pairplots can make differences and similarities clearer.

By default, the pairplot has the 1-d distribution (e.g. histogram) on the diagonal, and scatterplots off the diagonal.

Try making pairplots of the small dataset and the large dataset.

The structure of the distributions is hard to see from a sample of 10 points, but much easier with a sample of 1000.

Now try adjusting the plots in a way that's most informative.
First, read the [seaborn help page for pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html). 
Then, start experimenting.

You could try `kind = "kde"`, to start.

See what happens if you adjust the `Corner` argument, and decide whether that makes the plot more or less clear.

Which plots are most informative? How does that depend on the size and shape of the data?

## Generating correlated data

Next, we will generate some data with known correlations and see how that leads to patterns in plotting.

We will do this by first generating some deterministic and some random numbers, then creating new columns based on combining them.

In [None]:
# reseed random number generator with a different seed
rng_seed329 = np.random.default_rng(seed=329)

# generate a data frame with 10 random numbers of each
df_correlated100 = pd.DataFrame(
    {"a" : rng_seed329.random(size = 100),
     "b" : rng_seed329.normal(size = 100),
     "c" : rng_seed329.normal(size = 100)
    })

df_correlated100["a_plus_b"] = df_correlated100["a"] + df_correlated100["b"] 
df_correlated100["a5x_plus_c"] = 5 * df_correlated100["a"] + df_correlated100["c"] 
df_correlated100["b_minus_c"] = df_correlated100["b"] - df_correlated100["c"] 



Before you go any further, try to understand this code. What did we do? What tools from the course did we use? How many columns will the resulting data frame have?

In [None]:
# Space to trial some of these

Make sure you check the data frame too, using describe.

Now, make a pairplot. Be sure to relate the curves that you see on the plot with the numbers in `.describe()`.

How can you interpret this?

Try redoing the pairplot with `corner = True` and drawing regression lines with `type = "reg"`.

To interpret this plot, consider one panel at a time.

The panels on the diagonal are 1-d distributions, here histograms, of each variable individually.

- a is uniformly distributed
- b and c are normally distributed
- distributions of other variables reflect how they were created as sums of columns

Off the diagonal, the variables are scatter plots with linear regression.

- a vs b is not correlated, because they're independent
- a_plus_b is weakly positively correlated to a, because it was created from adding a to b
- a_plus_b is strongly correlated with b,
- **Keep going for all the panels**



It can be overwhelming to focus on only one plot from a big set like this. It makes sense to pick the most interesting from the pair plot and plot them again.

Try that with a lmplot of just a5x_plus_c against a.

*Tip: this plot shows the kind of data that linear regression is most suitable for - here a5x_plus_c is a variable that strongly depends on a, but has added variability with a normal distribution.*



Plot some other single panels, and try different visualisations, to understand the relationships better.

Maybe try a joint plot?

Try a 2d distribution plot as a kernel density estimate (`kde`). This shows how 2-d kernel density estimates highlight correlations between variables.


Lastly, try the pairplot as kernel density estimate (`kind = "kde"`).

Again, systematically think through the panels to understand what the plot might be showing you about the data distribution.

## Conclusion

Dealing with real data, as in the group project, this kind of systematic data exploration is an essential activity. The plots are tools to understand the data, and different kinds of plots are better for different kinds of data.

In this homework, we used randomly generated numbers with known properties to show how different plotting strategies can help us see those properties.

### Note: Why we used the numpy random number generator

Python usually has multiple ways to do the same thing, which can be confusing. There are different ways to generate random numbers in Python, including the `random` module. 

The numpy random number generators used here have a consistent way of generating an array of random numbers at the same time, which is helpful. The numpy random number generator documentation is quite technical and mathematical, beware. Alternative distributions are listed at [numpy Random Generator distributions list](https://numpy.org/doc/2.1/reference/random/generator.html#distributions).

Generating random numbers will not be on the class test. (It will be in the next class quiz.)

Plotting data and interpreting plots will be on the class test.