# #1. Get Cannabis Data

Hello everyone, my name is Keegan and its awesome to meet each of you. This group is for people who want to apply data science to cannabis data. My background is in economics and lab work, so I have an interesting perspective and abundant interest in the subject, so, I wanted to host a group. I can share some of my work, but this group can go where the group decides, so, please feel free to speak and share. And we'll take it from there.

## What does a data scientist do?
A senior analyst may spend the bulk of their time cleaning data to be used in prediction models. The data is presented in a final report in a simple format. The data sets may be in the thousands to in the billions.

1.	The first step is to find or build a dictionary of the data.
2.	Next, you need to import all of the data so that you can work with it.
3.	Now, you will need to begin to clean the data. This requires an understanding of the structure of the data.
4.	Once the data is defined, read, and formatted, then you can begin analysis. This is typically one of the faster stages if data is cleaned and formatted well. Data is then analyzed with mathematical and statistical functions, such as regression analysis, to make comparisons and gain insights.

The first rule of data science: look at the data. Next, you need to formulate a research question. Then you can begin to think about how to clean the data and what variables to keep. Data scientists may use timeseries regressions. You can find interesting points in time and compare the series from before and after the event to try to identify any structural breaks. You can control for region, such as urban or rural, and other geographical variables. It is often incredibly fruitful to combine multiple datasets.

## Reading the Data
Examples of way to work with Washington State cannabis traceability data.
<!--The first step of the pipeline with any data science related tutorial is usually the data loading component. Besides visually describing the dataset in use to your audience, also try to briefly explain (in one or two sentences) where the data came from, i.e., the source of the data. Other specifications like dimensions and attribute type are important but can be neatly explained with examples using code and tools such as `pandas`.-->

In [None]:
# Standard imports.
import datetime

# External imports.
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import abline_plot


# Read in lab results from the .csv file obtained through public
# records request.
file_name = 'LabResults_0.csv'
data = pd.read_csv(
    file_name,
    encoding='utf-16',
    sep='\t',
)
print('Number of observations:', len(data))

## Exploring the Data
Since you are teaching through writing and not actually live coding, resist the temptation to write code that does anything with the data like transformation or feature engineering before actually exploring it. It's a common mistake or practice that should be minimized. You want to give the readers some idea about the data through basic statistics, plots, and figures. Practise this as much as you can, and it will become an important habit in your data science work flow. Your readers will also appreciate the courtesy.

In [None]:
# Count the number of observations.
print('Number of observations:', len(data))

# Look at the data
obs = data.iloc[0].to_dict()
print(list(data.columns))

# Sum tests per day
LIMIT = 10000
sample = data[-LIMIT:]
print(list(sample.iloc[0].keys()))
print(list(sample['intermediate_type'].unique()))

## Preprocessing the Data
Although sometimes not necessary, as some datasets already come preprocessed, I believe it is important to slightly mention what type of preprocessing steps the data has undergone -- even if you need to do this through code examples. It should clarify any confusion that can present itself during the modeling section of the tutorial. Remember, your audience wants to get a broad understanding of the data before the modeling component of the tutorial, so try to explain this part of the tutorial as clear as possible with examples. Take advantage of your notebook features and other tools such as `matplotlib` and `pandas`.

In [None]:
# Create some variables
sample['time'] = pd.to_datetime(sample['tested_at'])
sample['lab'] = sample['global_id'].str.slice(2, 4)
print('Number of labs:', len(sample['lab'].unique()))

# Estimate test per day in February
flower = sample.loc[sample['intermediate_type'] == 'flower_lots']

# Perform some simple statistics.
high_thc = flower.loc[(flower['cannabinoid_d9_thca_percent'] <= 35) & 
                      (flower['cannabinoid_d9_thca_percent'] > 20) &
                      (flower['cannabinoid_cbda_percent'] < 0.5) ]


high_cbd = flower.loc[flower['cannabinoid_cbda_percent'] > 5]
high_cbd['moisture_content_water_activity_rate'].plot()

# Define independent and dependent variables for a regression.
X = high_thc[['cannabinoid_cbda_percent',]].fillna(0)
Y = high_thc['cannabinoid_d9_thca_percent'].fillna(0)

## Modeling the Data
If you are using tools such as PyTorch or TensorFlow for your data science projects, this section is reserved for the computation graph. Here you usually just state very briefly what you are building. No need to go into details just yet!

In [None]:
# Fit a regression model.
X = sm.add_constant(X)
model = sm.OLS(Y, X)
regression_results = model.fit()
print(regression_results.summary())

# Plot the regression
ax = high_thc.plot(
    x='cannabinoid_cbda_percent',
    y='cannabinoid_d9_thca_percent',
    kind='scatter'
)
abline_plot(model_results=regression_results, ax=ax)

## Testing the Model
One of the things I have learned over the years is that everything in data science is better understood with examples, rather than just using plain code or pictures. Before you begin training your models make sure to explain to the reader what the model is expecting as input and what it is expected to output. Rendering code here with nice descriptions help to prepare the reader on what to expect during training the model, especially since the training code is usually longer than most sections of the tutorial. With libraries like [PyTorch](https://pytorch.org/) and [DyNet](http://dynet.io/) this is fairly easy since they are dynamic computing libraries. TensorFlow also offers an [eager](https://www.tensorflow.org/guide/eager) execution command, `tf.enable_eager_execution()` to evaluate operations immediately. This is what's called imperative programming and I am glad they have it. It makes it easy to teach others about the beautiful things these tools are able to accomplish. I like to think that data science is about storytelling and discovery, and it should remain that way. Clear writing helps!

In [None]:
# Trend an analyte (butane) over time.
concentrate_types = [
    'hydrocarbon_concentrate',
    'concentrate_for_inhalation',
    'non-solvent_based_concentrate',
    'co2_concentrate',
    'food_grade_solvent_concentrate',
    'ethanol_concentrate',
]
concentrates = sample.loc[sample['intermediate_type'].isin(concentrate_types)]

# Aggregate data by day.
daily_concentrates = concentrates.groupby(concentrates.time.dt.date).mean()
daily_concentrates = daily_concentrates.loc[daily_concentrates.index > pd.to_datetime('2020-12-01')]

# Look at the data!
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(9, 5)
ax.plot(daily_concentrates.index, daily_concentrates.solvent_butanes_ppm)


## Training the Model
When training the models you would specify what kind of optimization, hyperparameters, and data iterating methods you are using. To be honest, the training code is usually self-explanatory. If you did your job at the beginning, explaining your dataset and testing the model, this part of the tutorial is probably the one that needs less explanation. In my experience, most data computing libraries use similar training strategies, thus the training structure has become ubiquitous in some sense. If there is still any clarification in your training that you need the reader to know, you can always explain it beforehand. 

In [None]:
# Fit a trend line.
X = daily_concentrates.index.map(datetime.date.toordinal)
Y = daily_concentrates['solvent_butanes_ppm'].fillna(0).values
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results = model.fit()
print(results.summary())

## Evaluating the Model
And lastly, it is  good practice to evaluate your models on some held out samples of the dataset. This helps the reader to get a gist of what the tutorial you just showed him/her contains. It also helps to re-emphasize on the values the tutorial is providing for the reader. This part of the tutorial also helps to finalize your final thoughts and share insights with your readers. Readers love insights. You can share plots, a lot of examples, and even explore the parameters of the model. 

In [None]:
# Plot the trend line with the daily data points.
ax.plot(daily_concentrates.index, results.fittedvalues, c='r')
ax.set_ylabel('ppm')
ax.set_title('Average Butane levels in WA Concentrates', fontsize=18)
fig.autofmt_xdate()

## Final Thoughts
You are not writing a book, so it is not necessary to have a conclusion section. In my experience, you use the final section to summarize all your findings and the future ideas you are working on. This is also a great time to congratualte the reader for making it to the end of the tutorial -- that's a huge achievement. You show that you appreciate the readers. Then you can end the section with your favorite quote. 

And that's it! Congratulations for reaching the end of this primer. You are now more than equipped to deliver excellent tutorials to the whole data science community and to a wider audience. With this short primer, you should reach thousands, and hopefully millions, but most importantly, with it, you should be able to bring value to your readers and keep expanding the human knowledge base. 

## References

We thank [The Cannabis Observer](https://cannabis.observer/) for diligently requesting public Washington State cannabis traceability data records.

### Data Sources

- [Cannabis Genome Data](https://www.kaggle.com/paultimothymooney/how-to-query-the-1000-cannabis-genomes-project)
- [WSLCB December 2020 Data](https://lcb.app.box.com/s/fnku9nr22dhx04f6o646xv6ad6fswfy9?page=1)

### Resources

- [Add footnote under the x-axis using matplotlib](https://stackoverflow.com/questions/7917107/add-footnote-under-the-x-axis-using-matplotlib)
- [Data Analysis with Pandas Blog Series](https://hackersandslackers.com/series/data-analysis-pandas/)
- [Data Visualization With Seaborn and Pandas](https://hackersandslackers.com/plotting-data-seaborn-pandas/)
- [How to build a regression model in python?](https://stackoverflow.com/questions/44325017/how-to-build-a-regression-model-in-python)
- [How to parse tsv file with python?](https://stackoverflow.com/questions/42358259/how-to-parse-tsv-file-with-python)
- [How to plot statsmodels linear regression (OLS) cleanly](https://stackoverflow.com/questions/42261976/how-to-plot-statsmodels-linear-regression-ols-cleanly)
- [Linear Models](https://scikit-learn.org/stable/modules/linear_model.html)
- [Python Pandas Error tokenizing data](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data)
- [Rotate axis text in python matplotlib](https://stackoverflow.com/questions/10998621/rotate-axis-text-in-python-matplotlib)
- [Select rows from a DataFrame based on multiple values in a column in pandas](https://stackoverflow.com/questions/36410075/select-rows-from-a-dataframe-based-on-multiple-values-in-a-column-in-pandas)
- [Summing the number of occurrences per day pandas](https://stackoverflow.com/questions/17706109/summing-the-number-of-occurrences-per-day-pandas)
- [UnicodeDecodeError when reading CSV file in Pandas with Python](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python)
- [WAC 314-55-102](https://apps.leg.wa.gov/wac/default.aspx?cite=314-55-102)
- [WSLCB How to Make a Public Records Request](https://lcb.wa.gov/records/make-public-records-request)

Written with ❤️ by [The Cannabis Data Science Team](https://cannlytics.com/team).