# Diamonds

The diamonds dataset is a well known dataset. It's one of the basic examplesets you get in the R-programming suite, but you can also [download](https://www.kaggle.com/datasets/shivam2503/diamonds) it as a csv. We've done that for you, however. But still we have some exercises using it (and drawing pretty graphs, of course).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df=pd.read_csv("https://raw.githubusercontent.com/mjochen/Beobank_course_material/main/Data_science/5%20Pandas%20introduction/Exercises/files/diamonds.csv", index_col=0)
df.head()


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Also describe the dataset, so you have an idea of how big it is and what sizes we are talking about.

In [None]:
df.describe()

Create a bar-chart showing how many diamonds there are of every cut. You can use matplotlib and grouping, but easier would be to use seaborn's countplot.

The bottom line of the previous graph is:

![](https://raw.githubusercontent.com/mjochen/Beobank_course_material/main/Data_science/5%20Pandas%20introduction/Exercises/files/2022-08-30-13-39-46.png)

Which isn't good because there is an order in the cut of diamonds. It's:

<code>['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']</code>

Show a distribution of diamonds by size (aka carat). Try to do it in three words. (A _histogram_ on _carat_ in the dataset _df_, but in reverse).

A good plot, but when doing histograms the bin width is important. Draw three with varying bin widths (5, 10 and 40).

In the last graph you notice that more is going on in this dataset. There are spikes. Also, why is does the line go to 5 when the last datapoint is at 3?

Show the diamonds with a carat size above 3.5.

These are what we call outliers. They're not that interesting, especially since we're talking about 10 observations in a list of 50.000 of them.

So:
* A histogram
* 200 bins
* Carat size beneath 3.5

A non-technical question for once:

* Which values are most common? Why?
* Which values are rare? Why? Is this expected?
* Why are there more values to the right of the peak?
* Why almost no diamonds bigger than 3 carats?
* There seem to be (sub)groups, or clusters, of similar values.
    * How are observations in cluster similar?
    * How are observations from separate clusters different?

Plot 100 bins in a histogram on the Y value (which is the depth of the diamong in mm).

Many outliers here! Show price, x, y and z for every diamond with Y bigger than 20 or smaller than 3.

You should note that:

* A couple of diamonds have no size (in this dataframe) and should not be considered in this graph.
* There's a diamond 5.8cm wide that only costs $12k.

Redraw the graph without these values.

We have ignored the complete observations now, which was fine because we were only looking at Y. But when continuing to work with a dataset we can't just drop the observations because that would mean we lose all data in those rows.

So set all X, Y and Z values of 0 or 20 or more to NA. Use [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html?highlight=replace#pandas.DataFrame.replace) or [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

Now show the same table as before. Which is very hard because the old code won't work. But we saved you a list of line numbers:

<code>[11964,15952,24068,24521,26244,27430,49190,49557,49558]</code>

Use [iloc](https://www.statology.org/pandas-select-rows-by-index/).

Getting the wrong lines? There is a difference between the line number and the index. This difference is 1 (index is 0-based, line number is 1-based).

And a new histogram on Y without the filtering?

Based on this histogram, does a boxplot look like a good idea?

No. You miss out on all of the nuances of sizes. Now do multiple boxplots (using seaborn) with all prices per cut.

Looks wrong, doesn't it? You would expect the premium and ideal diamonds to be more expensive. Do the same, but show the weight in stead of the price.

Still the same picture. Maybe the color comes into play? Compare the cut with the clarity.

Both are categorical variables by the way, so a [crosstable](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) would be a good idea.

What would happen if we plot the price vs the carat? Both are continuos variables, so a scatterplot is fine.

You could simply write <code>plt.scatter(df.carat, df.price)</code>, but there would me much room for improvement.

* There are 50.000 dots on there. Make sure they are small enough.
* Add titles on the axes.
* Add a trendline in the first order
* Add a trendline in the fourth order
* Make sure Y is limited between 0 en 20.000 (no values there, but the trendlines want to get out of those limits)

Note the covariance: if price goes up, so goes the weight. Maybe it's the other way around, but you would need a domain-specialist for that (or some common sense). Although covariance or correlation don't always imply causality.

![](https://raw.githubusercontent.com/mjochen/Beobank_course_material/main/Data_science/5%20Pandas%20introduction/Exercises/files/2022-08-31-15-35-41.png)

[spurious-correlations](http://www.tylervigen.com/spurious-correlations)