# Data structures are ways to organize data

This notebook covers the basics of data structures and how they apply to future modules.

# Data structures can be primitive (e.g. lists, numbers, strings):

In [None]:
car_info = ["Black", "Mazda", "Miata"]  # this is a List!
car_make = "Mazda"  # this is a String!
car_model = "Miata"  # this is also a String!
car_color = "Black"  # yet another String!
car_year = 2014  # this is an Integer

# Data structures can be user-defined (to store more complex data):
- This is done through Class definitions. A simple example:

In [None]:
class Car:
    num_wheels = 4
    
    def __init__(self, make, model, color):
        self.make = make
        self.model = model
        self.color = color


<img src="https://intellipaat.com/mediaFiles/2019/03/python10.png">
image source: https://intellipaat.com/

In [None]:
my_car = Car(make=car_make, model=car_model, color=car_color)

In [None]:
print(f"I just bought a {my_car.color} {my_car.make} {my_car.model}!")

#### In the above example, we have create a data structure that helps us organize what information (attributes) relating to the Car Class.

- The ```__init__()``` is special function that allows us to define instance attributes, by assigning values to a variable name inside this function. Instance attributes are set once an instance gets created. Not as important for this class, but worth pointing out.
- Note the num_wheels variable above the ```__init__()``` function. This is an example of a class attribute, attributes that have the same value for all class instances. By this definition, every Car should have 4 wheels. Also not as important for this class.
- Attributes can be accessed using "dot notation" (important!)

In [None]:
my_car.make

In [None]:
my_car.num_wheels

#### Instance methods are functions inside the class that can be from each class instance:

In [None]:
class Car:
    num_wheels = 4
    
    def __init__(self, make, model, color):
        self.make = make
        self.model = model
        self.color = color
        
    def description(self):
        return f"This is a {my_car.color} {my_car.make} {my_car.model}!"
    
    def set_color(self, new_color):
        self.color = new_color
    
my_car = Car(make=car_make, model=car_model, color=car_color)
my_car.description()

#### Can we use instance attributes to change instance attributes?

In [None]:
my_car.set_color("Red")
my_car.description()

In [None]:
import pandas as pd

car_info = ["Mazda", 2014, "Mazda3"]
car_make = "Mazda"
car_model = "Mazda3"
car_year = 2014
car_name = "Zoom zoom"
car_smog_history = pd.DataFrame([['1/1/2014','PASS'],['1/1/2015','PASS'],['1/1/2016','FAIL']], columns=['date','passfail'])

class Car:
    num_wheels = 4
    
    def __init__(self, name, model, year, smog_history):
        self.name = name
        self.model = model
        self.year = year
        self.smog_history = smog_history
        
    def description(self):
        return f"This car's name is \"{my_car.name}\" and it is a {my_car.year} {my_car.model}. It has {my_car.num_wheels} wheels."
    
    def set_name(self, new_name):
        self.name = new_name

my_car = Car(name=car_name, model=car_model, year=car_year, smog_history=car_smog_history)
my_car.smog_history

# Now a real example

In [None]:
import numpy as np
import pandas as pd
import scanpy as sc

## Scanpy (and other packages) use the AnnData Class to organize single cell data. 
- This data may span multiple dimensions. In other words, AnnData objects can for a given gene and cell, store counts, normalized counts, cell assignments, etc.
- The example below ([sc.read_10x_mtx()](https://scanpy.readthedocs.io/en/stable/generated/scanpy.read_10x_mtx.html)) returns an instance of an [AnnData](https://anndata.readthedocs.io/en/stable/generated/anndata.AnnData.html#anndata.AnnData) Class (AKA. an AnnData Object).
- Scanpy provides a nice visualization of AnnData's attributes:

<img src="https://falexwolf.de/img/scanpy/anndata.svg">

#### Here's the example. Let's use ```sc.read_10x_mtx()``` which takes in 10X Cellranger output (.mtx) and creates an object (an instance of an AnnData Class) called ```adata```:

In [None]:
adata = sc.read_10x_mtx(
    'public-data/1_programming/data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)                              # write a cache file for faster subsequent reading

#### Similar to car.make and car.model, we can access adata attributes:

In [None]:
# holds variable (gene) names
adata.var

In [None]:
# holds observation (cells) names
adata.obs

In [None]:
# holds read counts (in a not-super-readable "sparse" matrix format). But you can see it anyway: 
adata.X

In [None]:
# Simply printing adata on its own returns a nice summary of the instance:
print(adata)

## Throughout the course of this module, you may encounter functions that modify adata objects directly, or "in place":
- Note that the number of elements shrinks after applying a minimum gene filter function ([sc.pp.filter_cells()](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.filter_cells.html)):

In [None]:
adata.X

In [None]:
sc.pp.filter_cells(adata, min_genes=1000)

In [None]:
adata.X

# If you are unsure which function modifies objects in place (or what the function returns), use the "help" function to retrieve associated documentation

In [None]:
help(sc.pp.filter_cells)  # Alternatively, Googling the function returns its help page: https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.filter_cells.html

### In its documentation, we can see that the function requires a ```data``` parameter (of type ```anndata._core.anndata.AnnData```).

The function also optionally accepts various filter params:
- min_counts
- min_genes
- max_counts
- max_genes

As well as an 'inplace' flag, which tells the function whether or not to modify the AnnData object in place. As we had seen above, 'inplace' defaults to <b>True</b>:

filter_cells(data: anndata._core.anndata.AnnData, min_counts: Optional[int] = None, min_genes: Optional[int] = None, max_counts: Optional[int] = None, max_genes: Optional[int] = None, <b>inplace: bool = True</b>, copy: bool = False) -> Optional[Tuple[numpy.ndarray, numpy.ndarray]]


# To avoid potentially overwriting data within an object, I recommend running each notebook in order!
- This will help to avoid errors, as well as help you keep track of each step in your notebooks.
- Try running the below cells out of order to see how these plots change.

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20, )  # Run first

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20, )  # Run third

In [None]:
sc.pp.filter_cells(adata, min_genes=2000)  # Run second