# Purpose
This notebooks cleans the MSHA mine dataset. It does not attempt to link MSHA data with the coalmine_eia923 table.

In [1]:
import pandas as pd

zipfile_path = "https://arlweb.msha.gov/OpenGovernmentData/DataSets/Mines.zip"

def extract_msha_data(zipfile_path: str):
    mines = pd.read_csv(zipfile_path, sep="|", encoding="latin_1")
    mines.columns = [column.lower() for column in mines.columns]
    return mines

In [2]:
mines = extract_msha_data(zipfile_path)

In [3]:
mines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89781 entries, 0 to 89780
Data columns (total 59 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   mine_id                      89781 non-null  int64  
 1   current_mine_name            89780 non-null  object 
 2   coal_metal_ind               89781 non-null  object 
 3   current_mine_type            89470 non-null  object 
 4   current_mine_status          89781 non-null  object 
 5   current_status_dt            89781 non-null  object 
 6   current_controller_id        88877 non-null  object 
 7   current_controller_name      88877 non-null  object 
 8   current_operator_id          88942 non-null  object 
 9   current_operator_name        89781 non-null  object 
 10  state                        89781 non-null  object 
 11  bom_state_cd                 89781 non-null  int64  
 12  fips_cnty_cd                 89781 non-null  int64  
 13  fips_cnty_nm    

## Correct dtypes

### Correct datetime types
There are a handful of date columns that need to be converted to datetime objects:
- `current_status_dt`
- `current_controller_begin_dt`
- `current_103i_dt`

In [4]:
date_fields = ["current_status_dt", "current_controller_begin_dt", "current_103i_dt"]
for field in date_fields:
    mines[field] = pd.to_datetime(mines[field])
    
date_field_rename_map = {field_name: field_name[:-2] + "date" for field_name in date_fields}
mines = mines.rename(columns=date_field_rename_map)
mines[date_field_rename_map.values()].head()

Unnamed: 0,current_status_date,current_controller_begin_date,current_103i_date
0,1979-01-22,1989-07-01,NaT
1,2003-03-04,2000-06-14,NaT
2,1989-08-15,1989-07-31,NaT
3,1976-09-24,2002-01-17,NaT
4,1975-11-14,1950-01-01,NaT


### Correct integer types
There are some integer columns that pandas interprets as floats because pandas assumes numeric data with missing values are floats.

In [5]:
int_fields = [
    'days_per_week',
    'hours_per_shift',
    'prod_shifts_per_day',
    'maint_shifts_per_day',
    'no_employees',
    'avg_mine_height',
    'methane_liberation',
    'no_producing_pits',
    'no_nonproducing_pits',
    'no_tailing_ponds',
    'miles_from_office'
]

for field in int_fields:
    mines[field] = mines[field].astype("Int64")
    
mines[int_fields].dtypes

days_per_week           Int64
hours_per_shift         Int64
prod_shifts_per_day     Int64
maint_shifts_per_day    Int64
no_employees            Int64
avg_mine_height         Int64
methane_liberation      Int64
no_producing_pits       Int64
no_nonproducing_pits    Int64
no_tailing_ponds        Int64
miles_from_office       Int64
dtype: object

## More mine table cleaning

### Inspect missing values

In [6]:
mines.isna().sum()

mine_id                              0
current_mine_name                    1
coal_metal_ind                       0
current_mine_type                  311
current_mine_status                  0
current_status_date                  0
current_controller_id              904
current_controller_name            904
current_operator_id                839
current_operator_name                0
state                                0
bom_state_cd                         0
fips_cnty_cd                         0
fips_cnty_nm                         0
cong_dist_cd                     21198
company_type                       839
current_controller_begin_date      839
district                             0
office_cd                            0
office_name                          0
assess_ctrl_no                   53684
primary_sic_cd                     467
primary_sic                        467
primary_sic_cd_1                   467
primary_sic_cd_sfx                 467
secondary_sic_cd         

#### `current_mine_type`

In [7]:
mines.current_mine_status.value_counts(normalize=True)

Abandoned               0.744255
Abandoned and Sealed    0.097393
Active                  0.071106
Intermittent            0.062708
Temporarily Idled       0.013600
New Mine                0.006750
NonProducing            0.004188
Name: current_mine_status, dtype: float64

In [8]:
mines[mines.current_mine_type.isna()].current_mine_status.value_counts(normalize=True)

New Mine     0.794212
Abandoned    0.202572
Active       0.003215
Name: current_mine_status, dtype: float64

There are a few hundred mines that are missing current_mine_type. A disporpotionate nubmer of them are new mines so maybe mines don't receive a type until they are active?

#### `current_controller_id`

In [9]:
print(mines[mines.current_controller_id.isna()][["current_controller_name"]].isna().value_counts())
print()
print(mines[mines.current_controller_id.isna()][["current_operator_id"]].isna().value_counts())
print()
print(mines[mines.current_controller_id.isna()][["current_operator_name"]].isna().value_counts())

current_controller_name
True                       904
dtype: int64

current_operator_id
True                   839
False                   65
dtype: int64

current_operator_name
False                    904
dtype: int64


All ~900 mines missing controller ids are also missing controller names. 837 of these mines 902 are missing operator ids but all of the them have operator names. Maybe these mines were last updated when MSHA only reported operator names?

#### `current_operator_id`

In [10]:
print(mines.current_operator_id.isna().value_counts())
print()
print(mines.current_operator_name.isna().value_counts())

False    88942
True       839
Name: current_operator_id, dtype: int64

False    89781
Name: current_operator_name, dtype: int64


It looks like the 837 mines without an operator_id also don't have a controller_id. However, every mine has an operator name. I could using an entity resolution tool to assign operator ids to the records with missing ids using the current_operator_name.

#### `cong_dist_cd`

In [11]:
print(mines[mines.cong_dist_cd.isna()]["longitude"].isna().value_counts())
print()
print(mines[mines.cong_dist_cd.isna()].nearest_town.isna().value_counts())

False    12022
True      9176
Name: longitude, dtype: int64

False    20516
True       682
Name: nearest_town, dtype: int64


Almost a quarter of mines are missing congressional district codes. However, many of the records missing congressional district codes have geographic information like long, lat and nearest town that could be used to impute congressional district codes.

## Data Structure
Is there a unique identifier for the data? Is there data that can be noramlized?

### Unique ID

In [12]:
mines.mine_id.is_unique

True

Amazing! The `mine_id` is a unique identifier for the data. Are the mine entities actually unique though?

In [13]:
mine_fields = mines.columns.to_list()
mine_fields.remove("mine_id")
duplicate_mines = mines[mine_fields].duplicated(keep=False)
duplicate_mines.value_counts()

False    89760
True        21
dtype: int64

Unfortantely, there are a few mines with multiple msha_ids. I'm hesitant to drop the duplicates because I don't know which msha ids are used in other datasets. A possible solution would be to remove duplicate msha ids in this dataset that do not appear in other MSHA datasets.

### Controller and Operators
It looks like there are some entities called controllers and mines that might be associated with multiple mines. Lets pull them out into separate tables.

In [14]:
mines.groupby("current_controller_id").mine_id.count().describe()

count    40269.000000
mean         2.207082
std          8.282999
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        951.000000
Name: mine_id, dtype: float64

In [15]:
mines.groupby("current_operator_id").mine_id.count().describe()

count    48977.000000
mean         1.815995
std          3.553414
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        285.000000
Name: mine_id, dtype: float64

It looks like controllers and operators can be associated with multiple mines.

#### Controllers

In [16]:
print(mines.current_controller_id.isna().value_counts())
print()
print(mines.current_controller_name.isna().value_counts())

False    88877
True       904
Name: current_controller_id, dtype: int64

False    88877
True       904
Name: current_controller_name, dtype: int64


There are 902 mines that don't have a controller name or controller id so it is safe to create a separate controller table.

In [17]:
controller_field_rename_map = {"current_controller_id": "controller_id", "current_controller_name": "controller_name"}

controllers = mines[controller_field_rename_map.keys()].copy()
controllers = controllers.drop_duplicates()

mines = mines.drop(columns="current_controller_name")
controllers = controllers.rename(columns=controller_field_rename_map)

In [18]:
controller_name_counts = controllers.controller_name.value_counts()
print(controller_name_counts.describe())
controller_name_counts.head(10)

count    39941.000000
mean         1.009164
std          0.103838
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          4.000000
Name: controller_name, dtype: float64


Bennett James               4
Smith David                 4
Rock Products Inc           4
Sterling Coal Company       4
Sharp Charles               3
Meade Coal Company          3
Bear Branch Coal Company    3
Royal Coal Company          3
Betsy Kay Coal Co           3
Wright Larry                3
Name: controller_name, dtype: int64

A majority of controller ids are associated with unique controller names however there a few controller names that have multiple ids. We could pick a canonical controller id for each controller name and propagate the changes in the mines table. 

#### Operators

In [19]:
print(mines.current_operator_id.isna().value_counts())
print()
print(mines.current_operator_name.isna().value_counts())

False    88942
True       839
Name: current_operator_id, dtype: int64

False    89781
Name: current_operator_name, dtype: int64


837 mines are missing operator ids but none are missing operator names. Therefore, creating a separate operators table and dropping the operator name from the mines table would result in in a loss of information. I'm still going to create an operators data frame to better understand the missing and duplicate data.

In [20]:
operator_field_rename_map = {"current_operator_id": "operator_id", "current_operator_name": "operator_name"}

operators = mines[operator_field_rename_map.keys()].copy()
operators = operators.drop_duplicates()

# mines = mines.drop(columns="current_operator_name")
operators = operators.rename(columns=operator_field_rename_map)

In [21]:
operator_name_counts = operators.operator_name.value_counts()
print(operator_name_counts.describe())
operator_name_counts.head(10)

count    51870.000000
mean         1.061307
std          0.380891
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         20.000000
Name: operator_name, dtype: float64


Ligon Preparation Company    20
B & B Coal Company           20
Smith Coal Company           16
M & M Coal Company           15
C & C Coal Company           14
H & H Coal Company           12
B & M Coal Company           12
C & H Coal Company           11
R & R Coal Company           10
C & S Coal Company            9
Name: operator_name, dtype: int64

Like controllers, operator ids are mostly associated with unique operator names but there are some operator names associated with multiple operator ids.

Some additional cleaning I could do with these operator and controller names:
- Pick a canonical name for each ID and propagate the new id into the mines table. This would allow us to see all mines operated by a single operator using operator ID.
- Standardize company strings (LLC, Llc, LLC., INC., Inc, inc) and ensure comapny names are in Title Case.
- Better understand the differences between operators and controllers. Is there overlap between the entities? Can we create an additional table of mining companies so we can find mines a company operators and controls with a single ID?
- Find additional duplicate operators and controllers using an entity resolution tool. 
- Impute missing operator ids using an entity resolution tool or exact matching.

#### `primary_sic_cd`
Standard Industrial Classification Code (SIC) code for the primary commodity at mines could be a separate table.

In [22]:
mines["primary_sic_cd"] = mines["primary_sic_cd"].astype("Int64")

In [23]:
print(mines.primary_sic.nunique())
print(mines.primary_sic_cd.nunique())
print()
print(mines.primary_sic.value_counts().head(10))

120
120

Coal (Bituminous)                33711
Construction Sand and Gravel     28266
Crushed, Broken Limestone NEC     6489
Gold Ore                          2984
Crushed, Broken Stone NEC         2720
Dimension Stone NEC               1579
Coal (Anthracite)                 1546
Crushed, Broken Traprock          1151
Sand, Common                       959
Common Clays NEC                   744
Name: primary_sic, dtype: int64


In [24]:
print(mines.primary_sic_cd.isna().value_counts())
print()
print(mines.primary_sic.isna().value_counts())
print()

# Are there any records that have a primary_sic description but no code? 
(~mines.primary_sic.isna() & mines.primary_sic_cd.isna()).any()

False    89314
True       467
Name: primary_sic_cd, dtype: int64

False    89314
True       467
Name: primary_sic, dtype: int64



False

There a few hundred records missing primary_sic information but there are no records that have a primary_sic description but no code which means it's safe to extract a separate table.

In [25]:
sic_field_rename_map = {"primary_sic_cd": "primary_sic_code", "primary_sic": "primary_sic_desc"}

primary_sic = mines[sic_field_rename_map.keys()].copy()
primary_sic = primary_sic.drop_duplicates()

mines = mines.drop(columns="primary_sic")
primary_sic = primary_sic.rename(columns=sic_field_rename_map)

In [26]:
assert primary_sic.primary_sic_desc.is_unique

Amazing. The primary sic descriptions are actually unique.