# Minerals
This Jupyter Notebook uses mineral data that has been [scraped with a python script](https://github.com/florianneukirchen/scrape_mineral_data) from [Wikipedia](https://github.com/florianneukirchen/jupyter-notebooks/blob/main/minerals-cleaned.csv). I slightly cleaned the data of my original [mineral.csv](https://raw.githubusercontent.com/florianneukirchen/scrape_mineral_data/main/minerals.csv) and dropped some columns. Beware of missing data and unexpected values / format of values.

Data: [minerals-cleaned.csv](https://github.com/florianneukirchen/jupyter-notebooks/blob/main/minerals-cleaned.csv), © Wikipedia editors and contributors, [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License).

Notebook by [Florian Neukirchen](https://www.riannek.de/), requires [Plotly](https://plotly.com/python/getting-started/) and [pandas](https://pandas.pydata.org/).

In [72]:
import re
import plotly.express as px
import pandas as pd

minerals = pd.read_csv("minerals-cleaned.csv")
minerals.head()

Unnamed: 0,name,url,chemistry,chemistry html,IMA Symbol,Strunz class,crystal system,crystal class,color,cleavage,mohs,streak,gravity,luster,habit,varieties
0,Abelsonite,https://en.wikipedia.org/wiki/Abelsonite,C31H32N4Ni,C<sub>31</sub>H<sub>32</sub>N<sub>4</sub>Ni,Abl,10.CA.20,Triclinic,,"Pink-purple, dark greyish purple, pale purplis...",Probable on {111},2–3,Pink,1.45,"Adamantine, sub-metallic",,
1,Abenakiite-(Ce),https://en.wikipedia.org/wiki/Abenakiite-(Ce),Na26Ce6(SiO3)6(PO4)6(CO3)6(S4+O2)O,Na<sub>26</sub>Ce<sub>6</sub>(SiO<sub>3</sub>)...,Abk-Ce,9.CK.10,Trigonal,Rhombohedral,Pale brown,"{0001}, poor",4-5,White,"3.21 (meas.), 3.27 (calc.)",Vitreous,,
2,Abernathyite,https://en.wikipedia.org/wiki/Abernathyite,K(UO2)(AsO4)·3H2O,K(UO<sub>2</sub>)(AsO<sub>4</sub>)·3H<sub>2</s...,Abn,8.EB.15,Tetragonal,Ditetragonal dipyramidal,Yellow,Perfect on {001},2.5–3,Pale yellow,3.32 (measured) 3.572 (calculated),"Sub-Vitreous, resinous, waxy, greasy",,
3,Abhurite,https://en.wikipedia.org/wiki/Abhurite,Sn21O6(OH)14Cl16,Sn<sub>21</sub>O<sub>6</sub>(OH)<sub>14</sub>C...,Abh,3.DA.30,Trigonal,Trapezohedral,Colorless,,2,White,4.42,,"Platy, thin crystals, cryptocrystalline crusts",
4,Abramovite,https://en.wikipedia.org/wiki/Abramovite,Pb2SnInBiS7,Pb<sub>2</sub>SnInBiS<sub>7</sub>,Abm,2.HF.25a,Triclinic,Pinacoidal,Silver gray,Perfect on {100},,Black,,Metallic,Encrustations - Forms crust-like aggregates on...,


## Strunz class
The elements of a Strunz class are hierarchical from left to right. For example: The Strunz class of muscovite is 9.EC.15; that means:
- Class 9: Silicates and Germanates
- Mineral Division E: Phyllosilicates
- Mineral Familiy C: Phyllosilicates with mica sheets, composed of tetrahedral and octahedral nets
- Mineral/Group number 15

Therefore we should turn Strunz classes into 4 columns: e.g. '9.EC.15' to '9', 'E', 'C', '15'.

In [73]:
strunz = minerals['Strunz class'].str.split(".", expand=True)
minerals['strunz0'] = strunz[0]
minerals['strunz1'] = strunz[1]

minerals['strunz2'] = minerals['strunz1'].str[1:]
minerals['strunz1'] = minerals['strunz1'].str[0]

minerals['strunz3'] = strunz[2]

Get a subset of the dataframe without missing values

In [74]:
df = minerals[['strunz0', 'strunz1', 'strunz2', 'strunz3']].dropna()

Cast strunz0 to make it sortable

In [75]:
df['strunz0'] = df['strunz0'].astype(int)

Define a dictionary with the names of the classes, so we can map class '1' to 'elements' etc. Also add a column to the dataframe.

In [76]:
strunzclasses = {
    1: 'Elements',
    2: 'Sulfides, Sulfosalts',
    3: 'Halides',
    4: 'Oxides, Hydroxides, Arsenites',
    5: 'Carbonates, Nitrates',
    6: 'Borates',
    7: 'Sulfates, Chromates, Molybdates, Tungstates',
    8: 'Phosphates, Arsenates, Vanadates',
    9: 'Silicates',
    10: 'Organic Compounds',
}

df['strunz name'] = df['strunz0'].map(strunzclasses)

In [77]:
df.sort_values(by=['strunz0', 'strunz1', 'strunz2', 'strunz3'], inplace=True)

df.head()

Unnamed: 0,strunz0,strunz1,strunz2,strunz3,strunz name
353,1,A,A,05,Elements
141,1,A,A,10a,Elements
707,1,A,A,15,Elements
374,1,A,A,20,Elements
1402,1,A,B,10a,Elements


## Sunburst plots

Plotlys sunburst plots are a great way to visualize hierarchical data. The following plot has the class in the center, surrounded by the division and mineral family letters, with the size of each slice according to the count of minerals. You can double click on any class or division to filter the data (double click again to reset).

In [78]:
fig = px.sunburst(df, path=['strunz0', 'strunz1', 'strunz2'], height=600, title='Strunz class', hover_name=df['strunz name'])
fig.show()

The same as treemap:

In [79]:
fig = px.treemap(df, path=['strunz0', 'strunz1', 'strunz2'], title='Strunz class', hover_name=df['strunz name'])
fig.show()

Plot a histogram of the Strunz classes

In [80]:
df.head()

Unnamed: 0,strunz0,strunz1,strunz2,strunz3,strunz name
353,1,A,A,05,Elements
141,1,A,A,10a,Elements
707,1,A,A,15,Elements
374,1,A,A,20,Elements
1402,1,A,B,10a,Elements


In [81]:
labels = {'strunz name': 'Strunz class'}
fig = px.histogram(df, x='strunz name',  labels=labels)
fig.update_layout(xaxis_type='category')
fig.show()

In [82]:
labels = {'strunz0': 'Strunz class'}
category_orders = {'strunz0': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']}
fig = px.histogram(df, x='strunz0', category_orders=category_orders, labels=labels)

fig.update_layout(xaxis_type='category')
fig.show()

## Crystal system and crystal class

We can do the same for crystal system/class. Note that the data is a bit messy, some minerals have values such as "Triclinic or hexagonal", "Monoclinic/possibly prismatic", "Unkown space group". However it is still good enough to get a broad picture.

In [83]:
# Get a new dataframe with all data. 
# Note: I do this, because more or less rows might be dropped now because of NaN
df2 = minerals[['crystal system', 'crystal class']].dropna()

In [84]:
fig = px.sunburst(df2, path=['crystal system', 'crystal class'], height=600, title='Crystal system and crystal class')
fig.show()

## How do crystal system and Strunz class relate?
Strunz classes are based on chemistry, and it is obvious that there is a relationship of chemical composition and crystal lattice.

The next plot shows Strunz class in the center, surrounded by crystal system:

In [85]:
df3 = minerals[['strunz0', 'crystal system']].dropna()
df3['strunz0'] = df3['strunz0'].astype(int)
df3['strunz name'] = df3['strunz0'].map(strunzclasses)
df3.sort_values(by=['strunz0', 'crystal system'], inplace=True)
df3.head()

Unnamed: 0,strunz0,crystal system,strunz name
141,1,Cubic,Elements
147,1,Cubic,Elements
268,1,Cubic,Elements
353,1,Cubic,Elements
400,1,Cubic,Elements


In [86]:
fig = px.sunburst(df3, path=['strunz0', 'crystal system'], height=600, title='Strunz class and crystal system', hover_name=df3['strunz name'])
fig.show()

Crystal system in the center, surrounded by Strunz class:

In [87]:
fig = px.sunburst(minerals[['strunz0', 'strunz1', 'crystal system']].dropna(), path=['crystal system', 'strunz0', 'strunz1'], height=600)
fig.show()

## Sankey Diagrams
These are a bit more complicated to generate. We have to create some dictionaries and lists, I try to get as much as possible from the dataframe.

In [88]:
import plotly.graph_objects as go

In [89]:
df4 = df3.value_counts().reset_index(name='count')
df4.sort_values(['strunz0', 'crystal system'], inplace=True)
df4

Unnamed: 0,strunz0,crystal system,strunz name,count
39,1,Cubic,Elements,9
52,1,Hexagonal,Elements,5
46,1,Isometric,Elements,7
72,1,Monoclinic,Elements,1
48,1,Orthorhombic,Elements,6
...,...,...,...,...
14,9,Trigonal,Silicates,25
50,10,Monoclinic,Organic Compounds,5
54,10,Orthorhombic,Organic Compounds,4
64,10,Tetragonal,Organic Compounds,2


Get the data in the form of lists

In [90]:
# List of all labels
label = list(df4['strunz name'].unique()) + list(df4['crystal system'].unique())

# Links: for source and target we need to provide the indices of the label list
source = [label.index(name) for name in df4['strunz name']]
target = [label.index(name) for name in df4['crystal system']]
value = [count for count in df4['count']]

In [123]:
# Color Palettes: https://plotly.com/python/discrete-color/

# color for source and target
color1 = px.colors.qualitative.Pastel # Returns list of colors

# Only use colors for source, so I take a slice of the list and add some grey for the targets
color1 = color1[0:(df4['strunz name'].nunique())] + ['rgb(169,169,169)'] * df4['crystal system'].nunique()

# color for links: I want it to be the same as for source
color2 = [color1[label.index(name)] for name in df4['strunz name']]

In [125]:
fig = go.Figure(data=[go.Sankey(
    valuesuffix = " species",
    node = dict(
      pad = 20,
      thickness = 10,
      line = dict(color = "black", width = 0.5),
      label = label,
      color = color1
    ),
    link = dict( 
      source = source, 
      target = target,
      value = value,
      color = color2
  ))])

fig.update_layout(title_text="Strunz Class and Crystal System of Minerals", font_size=10)

# Footnote
fig.add_annotation(text='© Florian Neukirchen 2023 <br>Data: Wikipedia, Creative Commons', 
                    align='left',
                    showarrow=False,
                    xref='paper',
                    yref='paper',
                    x=0,
                    y=-0.22,
                    xanchor='left',
                    yanchor='bottom',
                    )


fig.show()

In [127]:
# Save plot
# fig.write_html("./plots/sankey.html")

## Count the elements in the formula

In [92]:
def unique_elements(s):
    e = set(re.findall('[A-Z][a-z]?', s))
    return ", ".join(sorted(e))

minerals['elements'] = minerals[minerals['chemistry'].notnull()]['chemistry'].apply(lambda row: unique_elements(row))
minerals['count elements'] = minerals['elements'].str.count(r"[A-Z]")

In [93]:
labels = {'strunz0': 'Strunz class', 'count elements': 'elements in formula', }

fig = px.violin(minerals[['count elements', 'crystal system', 'name', 'chemistry']].dropna(), 
                 x='crystal system', y='count elements', 
                 hover_data=['name', 'chemistry'], labels=labels)
fig.show()

In [94]:
df = minerals[['count elements', 'strunz0', 'name', 'chemistry']].dropna()
df['strunz0'] = df['strunz0'].astype('int')



fig = px.violin(df, x='strunz0', y='count elements', labels=labels, hover_data=['name', 'chemistry'], category_orders=category_orders, height=500)
fig.show()

In [95]:
fig = px.box(df, x='strunz0', y='count elements', labels=labels, hover_data=['name', 'chemistry'], category_orders=category_orders, height=500)
fig.show()

## Mohs scale
We often have ranges of Mohs scale (hardness), however the format on Wikipedia is not uniform, we have values such as '2', '2.5–3', '4-5', ' 2.0 - 2.5', '3~4' or '3 to 4' in the dataset. I use a regular expression to get the minimum and maximum value.

In [96]:
mohs = minerals['mohs'].str.findall(r"([0-9]\.?[0-9]?[0-9]?)")

In [97]:
mohs

0         [2, 3]
1         [4, 5]
2       [2.5, 3]
3            [2]
4            NaN
          ...   
1422      [6, 7]
1423      [3, 4]
1424         [7]
1425         NaN
1426         NaN
Name: mohs, Length: 1427, dtype: object

In [98]:
def mohs_max(l):
    if type(l) is float:
        return l
    if len(l) == 0:
        return np.nan
    if len(l) > 1:
        return float(l[1])
    else:
        return float(l[0])

minerals['mohs_min'] = mohs.str[0].astype('float')
minerals['mohs_max'] = mohs.apply(lambda row: mohs_max(row))
minerals['mohs_avg'] = minerals['mohs_min'] + minerals['mohs_max'] / 2
minerals['mohs_delta'] = minerals['mohs_max'] - minerals['mohs_min']

In [99]:
minerals[['mohs', 'mohs_min', 'mohs_max', 'mohs_avg', 'mohs_delta']]

Unnamed: 0,mohs,mohs_min,mohs_max,mohs_avg,mohs_delta
0,2–3,2.0,3.0,3.5,1.0
1,4-5,4.0,5.0,6.5,1.0
2,2.5–3,2.5,3.0,4.0,0.5
3,2,2.0,2.0,3.0,0.0
4,,,,,
...,...,...,...,...,...
1422,6 to 7,6.0,7.0,9.5,1.0
1423,3~4,3.0,4.0,5.0,1.0
1424,7,7.0,7.0,10.5,0.0
1425,,,,,


In [100]:
labels['mohs_avg'] = 'Mohs average'
labels['mohs_delta'] = 'Mohs delta'

fig = px.scatter(minerals[['strunz0', 'mohs', 'mohs_min', 'mohs_max', 'mohs_avg', 'mohs_delta', 'name']].dropna(), 
                 x='mohs_avg', y='mohs_delta', 
                 color='strunz0', 
                 hover_data=['name', 'mohs', 'mohs_min', 'mohs_max'], labels=labels)
fig.show()

In [101]:
fig = px.violin(minerals[['strunz0', 'mohs', 'mohs_min', 'mohs_max', 'mohs_avg', 'mohs_delta', 'name']].dropna(), 
                y='mohs_avg', x='strunz0', 
                hover_data=['name', 'mohs', 'mohs_min', 'mohs_max'],
                category_orders=category_orders, labels=labels)
fig.show()

In [102]:
fig = px.violin(minerals[['strunz0', 'mohs', 'mohs_min', 'mohs_max', 'mohs_avg', 'mohs_delta', 'name']].dropna(), 
                y='mohs_delta', x='strunz0', 
                hover_data=['name', 'mohs', 'mohs_min', 'mohs_max'],
                category_orders=category_orders, labels=labels)
fig.show()

In [103]:
fig = px.box(minerals[['strunz0', 'mohs', 'mohs_min', 'mohs_max', 'mohs_avg', 'mohs_delta', 'name']].dropna(), 
                y='mohs_delta', x='strunz0', 
                hover_data=['name', 'mohs', 'mohs_min', 'mohs_max'],
                category_orders=category_orders, labels=labels)
fig.show()