## Importing Data from Google Drive to Colab Environment

This notebook illustrates how to access data in your Google Drive account from Colab, import the data, and do some preliminary data cleaning before using it in analysis.

Click the badge below to open in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chuckgrigsby0/agec-784/blob/main/notebooks/01_load_data_into_colab_csv.ipynb)

The following code block mounts your Google Drive account, giving you access to your files saved in `MyDrive`.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Next we will import `pandas` and `numpy`. Note that this assumes you have your data saved in the `Data` folder within `MyDrive`.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('drive/MyDrive/Data/corn_production_by_state_2022_2017.csv')

The next lines of code show several useful attributes and methods to remember for better understanding the properties of your data.

In [None]:
# Column names
print(f"Column names: {df.columns.to_list()}")
print(f"First five rows:\n{df.head()}")

### Clean `Value` Column

In most cases, when you download USDA NASS data, the `Value` column containing our variable of interest will need to be cleaned before we can use it for analyses. The following code uses regular expressions [(regex)](https://en.wikipedia.org/wiki/Regular_expression) to remove any row containing a "(D)", a flag indicating the value is withheld to avoid disclosing individual farm data, and "(Z)" indicating when less than half of the unit is shown.

We also need to convert the `Value` column to a `float` data type, as it is formatted as a string when we initially import it.

The following line of code creates a boolean (True/False) vector that indicates when `Value` contains "(D)" or "(Z)".

In [None]:
df['Value'].astype(str).str.contains(r'\((?:D|Z)\)', regex=True, na=False).any()

In [None]:
mask = df['Value'].astype(str).str.contains(r'^\s*\((?:D|Z)\)\s*$', regex=True, na=False)

Because we want to keep rows *not* containing "(D)" or "(Z)", we use the `~` operator to invert the boolean mask. This converts `True` to `False` and `False` to `True`, so rows that matched the pattern (originally `True`) become `False` and are filtered out.

In [None]:
df = df[~mask]

In [None]:
# Verify that '(D)' and '(Z)' values have been removed
df['Value'].astype(str).str.contains(r'\((?:D|Z)\)', regex=True, na=False).any()

In [None]:
# Check the data type of the 'Value' column
df['Value'].dtype # 'O' indicates string variable type

If the `Value` column also contains `,` we also need to remove these before converting `Value` to a numeric variable type.

In [None]:
# Remove ',' from `Value` column
df['Value'] = df['Value'].astype(str).str.replace(',', '', regex=False)

Lastly, we need to convert `Value` from a string variable type to a numeric variable type. We use `pandas` `to_numeric()` function for this. We also drop any `NA` values to ensure the `Value` column is clean for analyses.  

In [None]:
# Convert to numeric
df['Value'] = pd.to_numeric(df['Value'], errors='coerce')

In [None]:
# Check for NAs
df['Value'].isna().sum()
# df.dropna(subset=['Value'], inplace=True) # Drop NAs if needed

In [None]:
# Show all numerical variables with 2 decimal places, no scientific notation
pd.set_option('display.float_format', lambda x: f'{x:.2f}')
df['Value'].describe()

In [None]:
# Unique years data
df['Year'].unique()

In [None]:
# Unique variable types
df['Data Item'].unique()

In [None]:
# Unique states
df['State'].unique()

In [None]:
# You can also combine text using "" and wrapping values inside {}
# for more descriptive output.
print(f"Unique counties in data include: {df['State'].unique()}")
print(f"Unique years in data include: {df['Year'].unique()}")
print(f"Unique cattle types in data include: {df['Data Item'].unique()}")

In [None]:
df_filter = df[df['Year'] == 2022]

In [None]:
df_filter['Year'].unique()

In [None]:
# Grouped statistics.
desc_stats = df.groupby(['State', 'Data Item']).agg({'Value': ['mean', 'std']})

In [None]:
print(desc_stats)

In [None]:
filename = 'desc_stats_cattle_by_county_and_type.csv'
desc_stats.to_csv(f"/content/drive/MyDrive/Data/{filename}", index=False)