In [1]:
from fastai.vision.all import *
from fastai.text.all import *
from fastai.collab import *
from fastai.tabular.all import *

FastAI applicaiton has this approach:

- Create appropriate DataLoaders
- Create a Learner
- Call a fit method
- Make predictions or view results.

#### **DataBlock**

In [2]:
data = DataBlock(
    blocks=(ImageBlock, CategoryBlock),             # Define data types for the input and output
    get_items=get_image_files,                      # Function to get image files
    splitter=GrandparentSplitter(train_name='train', 
                                 valid_name='test'),# Split data based on grandparent folder name
    get_y=parent_label,                             # Get labels from parent folder names
    item_tfms=Resize(224),                          # Resize images to 224x224
    batch_tfms=[
        *aug_transforms(),                          # Apply some data augmentation
        Normalize.from_stats(*imagenet_stats)       # Normalize image intensities (because pretraining)
    ]
)

This sets up a DataBlock, a versatile FastAI tool for building datasets. It specifies the data types (images and categories), how to retrieve images and labels, and how to split the dataset (80% for training and 20% for validation). The item_tfms and batch_tfms apply transformations such as resizing and data augmentation, which are crucial for improving model robustness and handling varying image sizes.

#### **DATA LOADER**

--> adding functionality to Pytorch's DataLoader class

 DataLoader (dataset=None, bs=None, num_workers=0, pin_memory=False,
             timeout=0, batch_size=None, shuffle=False, drop_last=False,
             indexed=None, n=None, device=None, persistent_workers=False,
             pin_memory_device='', wif=None, before_iter=None,
             after_item=None, before_batch=None, after_batch=None,
             after_iter=None, create_batches=None, create_item=None,
             create_batch=None, retain=None, get_idxs=None, sample=None,
             shuffle_fn=None, do_batch=None)

Aguments:

- dataset: dataset from which to load the data. Can be either map-style or iterable-style dataset.
- bs (int): how many samples per batch to load (if batch_size is provided then batch_size will override bs). If bs=None, then it is assumed that dataset.__getitem__ returns a batch.
- num_workers (int): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.
- pin_memory (bool): If True, the data loader will copy Tensors into CUDA pinned memory before returning them.
- timeout (float>0): the timeout value in seconds for collecting a batch from workers.
- batch_size (int): It is only provided for PyTorch compatibility. Use bs.
- shuffle (bool): If True, then data is shuffled every time dataloader is fully read/iterated.
- drop_last (bool): If True, then the last incomplete batch is dropped.
- indexed (bool): The DataLoader will make a guess as to whether the dataset can be indexed (or is iterable), but you can override it with this parameter. True by default.
- n (int): Defaults to len(dataset). If you are using iterable-style dataset, you can specify the size with n.
- device (torch.device): Defaults to default_device() which is CUDA by default. You can specify device as torch.device('cpu').

### Data Block/Loader EXAMPLES

#### 1 Using a DataBlock
- When you don't have a prebuilt dataset

In [4]:
from pathlib import Path

In [3]:
from fastai.vision.all import *


# Download and prepare the CIFAR-10 dataset
path = untar_data(URLs.CIFAR)

data = DataBlock(
    blocks=(ImageBlock, CategoryBlock),             # Define data types for the input and output
    get_items=get_image_files,                      # Function to get image files
    splitter=GrandparentSplitter(train_name='train', 
                                 valid_name='test'),# Split data based on grandparent folder name
    get_y=parent_label,                             # Get labels from parent folder names
    item_tfms=Resize(224),                          # Resize images to 224x224
    batch_tfms=[
        *aug_transforms(),                          # Apply some data augmentation (see Chapter 2)
        Normalize.from_stats(*imagenet_stats)       # Normalize image intensities (because pretraining)
    ]
)



In [6]:
dls = data.dataloaders(path, bs=64)
# dls.show_batch()

#### 2 Using .from_folder

```
(method) def from_folder(
    path: Any,
    train: str = 'train',
    valid: str = 'valid',
    valid_pct: Any | None = None,
    seed: Any | None = None,
    vocab: Any | None = None,
    item_tfms: Any | None = None,
    batch_tfms: Any | None = None,
    img_cls: type[PILImage] = PILImage,
    **kwargs: Any
) -> Any
```

In [12]:
path = untar_data(URLs.CIFAR)


dls = ImageDataLoaders.from_folder(path, train='train', valid='valid', item_tfms=Resize(224))
# dls.show_batch()

#### Other types

**from_name_func**:
- Allows you to create DataLoaders by defining a function that extracts labels from filenames. This method is useful when labels are embedded in the file names themselves.
**from_path_func**:
- Similar to from_name_func, but the function receives the full path of each file, providing more flexibility for extracting labels based on the file path.
**from_df**:
- Creates DataLoaders from a pandas DataFrame. This is particularly useful for datasets that come in tabular form where one of the columns might contain file paths or text data, and another column the labels.
**from_csv**:
- Similar to from_df, but loads data directly from a CSV file. You specify the path to the CSV file and details about which columns to use for the data and labels.
**from_lists**:
- Allows you to directly pass lists of data points and labels to create DataLoaders. This is handy when you have already pre-processed your data in memory.
*from_dblock*:
- Creates DataLoaders using a DataBlock, which is a more flexible and customizable way to define how to gather data, split it, and form batches. DataBlock allows you to compose a pipeline for loading and preprocessing the data.



#### 3. Tabular data loader

```
 TabularDataLoaders (*loaders, path:str|pathlib.Path='.', device=None)
```
Most used one:
```
TabularDataLoaders.from_df (df:pd.DataFrame, path:str|Path='.',
                             procs:list=None, cat_names:list=None,
                             cont_names:list=None, y_names:list=None,
                             y_block:TransformBlock=None,
                             valid_idx:list=None, bs:int=64,
                             shuffle_train:bool=None, shuffle:bool=True,
                             val_shuffle:bool=False, n:int=None,
                             device:torch.device=None,
                             drop_last:bool=None, val_bs:int=None)
```

In [13]:
path = untar_data(URLs.ADULT_SAMPLE)

df = pd.read_csv(path/'adult.csv')


In [16]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


In [30]:
# Define preprocessors
procs = [Categorify, FillMissing, Normalize]
dep_var = 'salary'
cont_vars, cat_vars = cont_cat_split(df, max_card=4000, dep_var=dep_var, )

splits = RandomSplitter()(range_of(df))
to = TabularPandas(df, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits, y_block=CategoryBlock())



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


In [32]:
# print(to.xs.iloc[:5]) 

### **Data transformation**

#### Functions for getting, splitting, and labeling data, as well as generic transforms


In [21]:
## GET DATA

path = untar_data(URLs.MNIST_TINY)
(path/'train').ls()


(#2) [Path('C:/Users/prebe/.fastai/data/mnist_tiny/train/3'),Path('C:/Users/prebe/.fastai/data/mnist_tiny/train/7')]

```get_files(path)```
- Get all files from path

``` get_images_files(path) ```
- This is simply get_files called with a list of standard image extensions.

-------------------------------------------

``` RandomSplitter (valid_pct=0.2, seed=None)```
- Create function that splits items between train/val with valid_pct randomly.

```TrainTestSplitter()```
- Split items into random train and test subsets using sklearn train_test_split utility.

```GrandparentSplitter(train_name="train", valid_name="valid")```
- Split items from the grand parent folder names (train_name and valid_name).
- Often used with prebuilt datasets



_______________________________________________

### LEARNER
##### Groups together a model, data loaders and a loss function to handle training

___________________

**Model**: 
This is the actual neural network that you want to train. It can be any PyTorch model.

**DataLoaders**: 
This is the FastAI object that contains your training and validation (and optionally test) data loaders. Each DataLoader provides batches of data to the model during training and validation.

**Optimizer**: 
This is the algorithm used to update the weights of the model based on the gradients of the loss function. Common optimizers include SGD, Adam, etc.

**Loss Function**: 
This is a function that measures the difference between the actual values and predictions, guiding the training process by indicating how well the model is performing.

**Metrics**: 
These are additional functions used to evaluate the performance of the model. Unlike the loss function, metrics are used for human interpretation and are not used for training.

In [48]:
from fastai.vision.all import *

# Load a sample dataset
path = untar_data(URLs.PETS)
files = get_image_files(path/"images")[:100]

def label_func(f): return f[0].isupper()

# Prepare DataLoaders
dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(224))

# Define a simple model
learn = vision_learner(dls, resnet34, metrics=accuracy)

# Train the model
# learn.fit_one_cycle(1)

#### Other learners


1. TextLearner
- language_model_learner
- text_classifier_learner


2. TabularLearner

___________________________
##### Text Learner


In [72]:

from fastai.text.all import *



path = untar_data(URLs.IMDB_SAMPLE)

df = pd.read_csv(path/'texts.csv')

dls = TextDataLoaders.from_df(df, path=path, text_col='text', label_col='label', valid_col='is_valid')

learn = text_classifier_learner(dls, AWD_LSTM)


Due to IPython and Windows limitation, python multiprocessing isn't available now.
So `n_workers` has to be changed to 0 to avoid getting stuck


##### Tabular Learner

In [73]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", valid_idx=list(range(800,1000)), bs=64)
learn = tabular_learner(dls)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


In [None]:
# Define preprocessors
procs = [Categorify, FillMissing, Normalize]
dep_var = 'salary'
cont_vars, cat_vars = cont_cat_split(df, max_card=4000, dep_var=dep_var, )

splits = RandomSplitter()(range_of(df))
to = TabularPandas(df, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits, y_block=CategoryBlock())



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


### Learning.fit()

 ```Learner.fit (n_epoch, lr=None, wd=None, cbs=None, reset_opt=False, start_epoch=0)```

 - n_epoch: antall epochs(antall treningssykluser)
 - lr: learning rate
 - wd: weight decay
 - cbs: callbacks

____________________________________

#### Finding a good learning rate

In [50]:
learning_rate = learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])

TypeError: Learner.lr_find() got an unexpected keyword argument 'num_iterations'