<a href="https://colab.research.google.com/github/dlab-berkeley/Computational-Social-Science-Training-Program/blob/master/Deep_Learning_and_Tensorflow_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Computational Social Science]
## 5-5 Deep Learning - Solutions


This notebook will introduce you to the fundamentals of Keras and explore techniques for deep learning with text data. Key concepts covered in this notebook include:

1. Google Colab and GPUs
2. Use keras to adapt and tune neural nets
3. Existing resources to help analyze language data


With these basic building blocks, you will be equipped to explore and implement deep learning algorithms for your own project.

# Google Colab

---



Objectives:

- Set up a Google Colab notebook
- Create, delete, run, and edit cells
- Cover variable, notebook and package management

15 minutes

## Introducing Google Colab


Google Colab is a platform for cloud-based computation and coding. It is similar to a Jupyter Notebook, where individual cells of code are executed sequentially. It doesn't require local installation on your computer and can be shared and edited by multiple people at the same time. However the Colab notebook requires you to be connected to the internet, while jupyter notebooks can be run on your local machine. Google Colab notebooks are in the .ipynb format, and can be saved and opened either directly or via Google Drive.



##Basic Operations

Google Colab has several features that help organize code and long notebooks. A few key concept to know to use this notebook effectively are:

- Use the Insert tab in the upper bar, or press the +Code/+Text buttons in the top left of the window.

- Text cells can be edited and formatting with the buttons at the top of the cell.

- The buttons at the top right of the cell give you options to move, modify and delete the cell.

- You can run code with shift+enter, or by clicking the top left of the box.

- For more commands, use ctl+shift+p and select the desired command from the command palette

An example code cell is below. Try executing and editing the cell.

In [None]:
print("Welcome to Google Colab")
x=12+78

The buttons on the left panel help manage the notebook (search, table of contents, files). This is important for organizing your code and navigating long notebooks.



## Package Management

Like Anaconda, Google Colab comes with many packages already available, and you can also install local packages using pip. Use the following lines of code in order to see which packages you have and which ones you need to install.



```
#check which packages you have available (listed alphabetically). The version numbers are also avaliable which can be useful in determining issues with coding between computers.
!pip list

#install a new package
!pip install numpy
```




<List of packages that we will use in this tutorial>

In the following cell are the packages that you will need to complete this notebook. Run this line of code to make sure that everything is installed properly.

In [46]:
#import packages for deep learning
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
# from keras.utils import np_utils
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import RMSprop

## Customization

Finally, there are several settings that you can customize if you so choose. These can be found under Tools -> Settings, where you can change the font size, background, and other aesthetic settings of the notebook to suit you.

In addition, in Tools -> Keyboard shortcuts you can view and adapt shortcuts to your preferences as well.




## Challenge

Try out the following exercises to get comfortable with the new interface:

1) Open the editor settings (Tools-> Settings->Editor) and select "Show line numbers". Now your cells will have line numbers next to them, which we can refer to when discussing code during this workshop.

2) Make a new code cell below and save the product of 60 and 72 to a new variable. Then check the value of the variable in the variable tab to the left.


In [None]:
#solutions
#1) follow the directions in the question. You should see the line numbers in
# each cell on the left side
#2)
new = 60*72 #check the value of this variable using the variable panel on the left

# Introduction to GPU

Objectives:
- Understand the benefits of GPUs
- Set up GPU for Google Colab
- Compare performance on tasks vs CPU


10 minutes

As you've found in your previous models, some models take a significant amount of time to run. Models may also exceed the capacity of the local computer's processing power. This will either result in code that never finished running, or an error message indicating that the code has timed out without completing.

To counteract this issue, TPU/GPU are parallel processing units that greatly speed up models. This can make some models that are otherwse impossible to train possible (Think minutes rather than hours)

TPU is made specifically for tensorflow architecture, and speeds it up even more than GPUs.

## GPU Access
Oftentimes you need to pay for cloud services and access to GPUs, but one advantage of Colab is that it has free access to a certain amount of GPU/TPU units. This access is somewhat limited, but should be more than enough for what we are using it for today. We will discuss limitations and further options for long-term use in a later section of the workshop.


Additional resource: https://colab.research.google.com/notebooks/gpu.ipynb#scrollTo=sXnDmXR7RDr2

The notebook will automatically choose which device (read: GPU vs CPU) to run the code on, but if you want to make sure that something is being run on a certain device, you can select a specific device as in the snippet below.


```
# This is formatted as code
with tf.device(device_name):
  #put task here
  #return output
```

For now, we will trust the notebook's/ Tensorflow's allocation of computing power.



## Challenge

1) Run the following lines of code to test how fast your computer can do a task. Report the results


In [None]:
import timeit

print(timeit.timeit('[x**2 for x in range(10)]'))

3.5632765139999947


2)  Change the settings to use GPU:  Edit --> Notebook Settings --> Hardware Accelerator --> GPU
. Run the code below to make sure GPU is enabled.

In [None]:
#run this code to check that you have the GPU enabled
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


3) Re-run the same timing task and report your results. How much of a difference is there in timing?

In [None]:
import timeit
print(timeit.timeit('[x**2 for x in range(10)]'))

2.6405670770000143


As we run more complex tasks, the efficiency of GPUs becomes more and more of a difference. If you are curious, you can compare the timing of the tasks in this notebook with GPU/TPU/CPU and note the difference. Even though in this notebook we are working with fairly small dataset and task, these differences will be important at larger scale.

# Deep Learning









Objectives:
- Code and optimize a neural network
- Adapt a network to new data

20 minutes

In previous sections of this course, you have covered neural networks and deep learning for classifying the MNIST dataset. The task was classifying handwritten digits 0-9 based on images. In this section, we will revisit deep learning in Python with text data.

We will start with the classificaton problem (student loan vs checking/savings account) from the NLP section of the course, where customer complaint data was used classify what type of account the complaint was related to. We will use the same embeddings we trained for the final logistic regression problem in that section of the course.

First, we will load in the data and split it into training and validation data.

In [47]:
word2vec_features_df=pd.read_csv('https://github.com/dlab-berkeley/Computational-Social-Science-Training-Program/raw/main/data/embeddings.csv').iloc[:, 1:]
y=pd.read_csv('https://github.com/dlab-berkeley/Computational-Social-Science-Training-Program/raw/main/data/y.csv')
y_vals=y['Product_binary'].values
X_train, X_test, y_train, y_test = train_test_split(word2vec_features_df,
                                                    y_vals,
                                                    train_size = .80,
                                                    test_size=0.20,
                                                    random_state = 10)


We will use Keras now. Keras is a widely-used high-level neural network API known for its simplicity and quick prototyping capabilities. While it was originally an independent project that could operate with various backends such as Theano and CNTK, TensorFlow has emerged as its primary backend. This integration allows Keras to take advantage of TensorFlow's powerful features, including its ability to run smoothly on different hardware like CPUs, GPUs, and TPUs. By utilizing TensorFlow as its underlying engine, Keras simplifies many complex aspects of deep learning, making the development of neural networks more accessible. Its user-friendly nature has led to extensive use in both industrial applications and academic research.

Next we define the model. In Keras, each layer of the model has to be individually specified. This allows significant control over the model, including different parameters for each level.

This model has a dense layer with 128 neurons in each, and a dropout layer where 20% of the connections are dropped out for each layer. The final output layer uses a sigmoid activation function to create a final binary output (0 or 1).

In [48]:
def NN_model():
    # create model
    model = Sequential()

    # A fully connected layer with 128 neurons
    model.add(Dense(128, input_dim=300,activation='relu'))

    # A dropout layer that randomly excludes 20% of neurons in the layer
    model.add(Dropout(0.2))

    # An output layer with binary classification
    model.add(Dense(1, activation='sigmoid'))

    # Compile model with crossentropy
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Finally, we fit and evaluate the model.

In [49]:
model = NN_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, verbose=0)

# Evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)

print("NN Error: %.2f%%" % (100-scores[1]*100))
print(model.summary())

NN Error: 21.50%
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_19 (Dense)            (None, 128)               38528     
                                                                 
 dropout_12 (Dropout)        (None, 128)               0         
                                                                 
 dense_20 (Dense)            (None, 1)                 129       
                                                                 
Total params: 38657 (151.00 KB)
Trainable params: 38657 (151.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


This is a simple neural network with a couple of densely connected layers and a couple of dropout layers. When working with neural nets, it's often a good idea to start with a simple net to make sure the basics of the code work, then gradually create more complicated architectures once the code runs smoothly.

Now, let's use our tensor knowledge to adapt this architecture to another set of data. First, let's load in the MNIST digits dataset (in practice, we would likely be using a dataset more similar to the one in the original model). The MNIST dataset is three dimensions (n_samplesx28x28), so we need to flatten the data for now to create a two dimensional tensor  n_samplesx784 to fit with the neural net we are working on. Note: instead of two classes, the MNIST dataset uses 10 classes (one for each digit 0-9).

In [50]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# reshape to [samples][width][height][pixels]
X_train = X_train.reshape(X_train.shape[0], 28*28)
X_test = X_test.reshape(X_test.shape[0], 28*28)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

Here is the same code from the NN model above. What do you need to change in order to run the same model on the new data? Note which parameters and values you need to change. How does this relate to the differences in the data? Let's edit the code to work with the new data shape and execute it.

Hint: use tf.shape() to see the compare the shapes of the MNIST and original dataset

These are the lines of code we need to change to make this model work with new data:


In line 6:
```
model.add(Dense(128, input_dim=784,activation='relu')) #change input dim
```
The embeddings dataset had 300 features, or columns, the new MNIST dataset has 784, so we need to make sure to match the numbers in model architecture.

In line 12:

```
    model.add(Dense(10, activation='softmax')) #change dimension to number of categories
```
The final layer needs to have 10 categories, rather than two, since there are more classes in the MNIST dataset. In addition, the activation function needs to be changed to softmax.

In lne 15:

```
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

```

Again, because of the number of classes, the loss function used must be categorical cross entropy rather than binary cross entropy.

Here is the updated model:





In [51]:
def diff_CNN_model():
    # create model
    model = Sequential()

    model.add(Dense(128, input_dim=784,activation='relu')) #change input dim

    model.add(Dropout(0.2))

    model.add(Dense(128, activation='relu'))
    model.add(Dropout(.2))

    model.add(Dense(10, activation='softmax')) #change dimension to number of categories and activation function

    #change to categorical crossentropy
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


model = diff_CNN_model()
# Fit the model
print(X_train.shape)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200, verbose=2)

# Evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("NN Error: %.2f%%" % (100-scores[1]*100))

(60000, 784)
Epoch 1/10
300/300 - 2s - loss: 5.2501 - accuracy: 0.6533 - val_loss: 0.7492 - val_accuracy: 0.8283 - 2s/epoch - 7ms/step
Epoch 2/10
300/300 - 1s - loss: 0.9208 - accuracy: 0.7827 - val_loss: 0.5430 - val_accuracy: 0.8724 - 891ms/epoch - 3ms/step
Epoch 3/10
300/300 - 1s - loss: 0.6803 - accuracy: 0.8351 - val_loss: 0.4452 - val_accuracy: 0.8939 - 895ms/epoch - 3ms/step
Epoch 4/10
300/300 - 1s - loss: 0.5580 - accuracy: 0.8616 - val_loss: 0.3704 - val_accuracy: 0.9176 - 880ms/epoch - 3ms/step
Epoch 5/10
300/300 - 1s - loss: 0.4781 - accuracy: 0.8818 - val_loss: 0.3551 - val_accuracy: 0.9178 - 1s/epoch - 4ms/step
Epoch 6/10
300/300 - 1s - loss: 0.4212 - accuracy: 0.8935 - val_loss: 0.3117 - val_accuracy: 0.9323 - 1s/epoch - 4ms/step
Epoch 7/10
300/300 - 1s - loss: 0.3684 - accuracy: 0.9053 - val_loss: 0.2657 - val_accuracy: 0.9401 - 890ms/epoch - 3ms/step
Epoch 8/10
300/300 - 1s - loss: 0.3358 - accuracy: 0.9144 - val_loss: 0.2481 - val_accuracy: 0.9430 - 833ms/epoch - 3ms/s

# Optimizing Neural Nets


Objectives:
- Explore strategies to optimize a neural net
- Implement an optimizer with custom settings
- Grid search parameters

20 minutes

Optimizing neural nets is a key point of using these powerful models effectively, as with any ML models. However, neural nets have many parameters that can be tuned and are a challenge for traditional optmization methods such as grid search.

In the previous challenge, we experimented with improving the accuracy of the model. The following strategies can help guide the optmization process for fine-tuning algorithms.

1. Feature engineering (refer to Natural Language Processing Notebook)

2. Try a smaller network (minimize redundancy) or a larger network (capture more complex relationships)

3. Change learning rate
4. Use appropriate architecture for the data/task

5. Test parameters

6. Decrease batch size

Depending on the task, data, and neural network used, there may be a significant amount of tuning necessary in order to achieve an optimal result. This is one reason why leveraging existing models that are already optimized can give a huge advantage for language tasks.

Further reference this article: https://towardsdatascience.com/optimizing-neural-networks-where-to-start-5a2ed38c8345


For this notebook we will start with changing the learning rate.

In previous examples, we passed the optimizer to the compile funciton
```
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

```
Which uses the default parameters for the function. Now that we are customizing the parameters, we want to use the actual optimizer function, and then pass that optimizer into the .compile() function.

```
model.compile(....,opt=keras.optimizers.Adam())
```

Here is the documentation for that function: https://keras.io/api/optimizers/adam/

What is the default parameter for learning rate? What are some of the other parameters for the Adam optimizer?

##Challenge

Test the following learning rates: [.0001,.001,.01,.1]. Which one performs the best? Which one performs the worst?



In [52]:
#load in data to use for this test
from tensorflow.keras.optimizers import Adam

#cnn classification for neural nets
word2vec_features_df=pd.read_csv('https://github.com/dlab-berkeley/Computational-Social-Science-Training-Program/raw/main/data/embeddings.csv').iloc[:, 1:]

y=pd.read_csv('https://github.com/dlab-berkeley/Computational-Social-Science-Training-Program/raw/main/data/y.csv')
y_vals=y['Product_binary'].values
X_train, X_test, y_train, y_test = train_test_split(word2vec_features_df,
                                                    y_vals,
                                                    train_size = .80,
                                                    test_size=0.20,
                                                    random_state = 10)
#print(word2vec_features_df.shape)
#print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

In [53]:
#solution
##1
def NN_model():
    # create model
    model = Sequential()

    model.add(Dense(128, input_dim=300,activation='relu')) #change input dim

    # A dropout layer that randomly excludes 20% of neurons in the layer
    model.add(Dropout(0.2))

    # A fully connected layer with 128 neurons
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(.2))

    # An output layer with softmax as in MLP
    model.add(Dense(1, activation='sigmoid'))
    adam_opt=Adam(learning_rate=.1)
    # Compile model as before in MLP
    model.compile(loss='binary_crossentropy', optimizer=adam_opt, metrics=['accuracy'])
    return model

In [54]:
model = NN_model()
# Fit the model
print(X_train.shape)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200, verbose=2)

# Evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error: %.2f%%" % (100-scores[1]*100))



(800, 300)
Epoch 1/10
4/4 - 1s - loss: 0.8766 - accuracy: 0.7375 - val_loss: 0.5715 - val_accuracy: 0.7850 - 1s/epoch - 281ms/step
Epoch 2/10
4/4 - 0s - loss: 0.5343 - accuracy: 0.7862 - val_loss: 0.5404 - val_accuracy: 0.7850 - 37ms/epoch - 9ms/step
Epoch 3/10
4/4 - 0s - loss: 0.5328 - accuracy: 0.7812 - val_loss: 0.5218 - val_accuracy: 0.7850 - 33ms/epoch - 8ms/step
Epoch 4/10
4/4 - 0s - loss: 0.5364 - accuracy: 0.7850 - val_loss: 0.5207 - val_accuracy: 0.7850 - 33ms/epoch - 8ms/step
Epoch 5/10
4/4 - 0s - loss: 0.5191 - accuracy: 0.7862 - val_loss: 0.5186 - val_accuracy: 0.7850 - 37ms/epoch - 9ms/step
Epoch 6/10
4/4 - 0s - loss: 0.5184 - accuracy: 0.7862 - val_loss: 0.5177 - val_accuracy: 0.7850 - 35ms/epoch - 9ms/step
Epoch 7/10
4/4 - 0s - loss: 0.5225 - accuracy: 0.7862 - val_loss: 0.5094 - val_accuracy: 0.7850 - 33ms/epoch - 8ms/step
Epoch 8/10
4/4 - 0s - loss: 0.5088 - accuracy: 0.7862 - val_loss: 0.5050 - val_accuracy: 0.7850 - 34ms/epoch - 8ms/step
Epoch 9/10
4/4 - 0s - loss: 0

# Challenge: Optimizing a Neural Net


The logit model from the challenge question in NLP section used to classify the customer complaint data had an accuracy of 78.5%. What is the accuracy of the first neural network model on the same data? Hint: (read the output) Try changing the model to improve accuracy. What configuration gave you the best results? Try changing the parameters of the existing layers, or adding more layers.

In [55]:
word2vec_features_df=pd.read_csv('https://github.com/dlab-berkeley/Computational-Social-Science-Training-Program/raw/main/data/embeddings.csv').iloc[:, 1:]
y=pd.read_csv('https://github.com/dlab-berkeley/Computational-Social-Science-Training-Program/raw/main/data/y.csv')
y_vals=y['Product_binary'].values
X_train, X_test, y_train, y_test = train_test_split(word2vec_features_df,
                                                    y_vals,
                                                    train_size = .80,
                                                    test_size=0.20,
                                                    random_state = 10)

In [56]:
#original model

def NN_model():
    # create model
    model = Sequential()

    # A fully connected layer with 128 neurons
    model.add(Dense(128, input_dim=300,activation='relu'))

    # A dropout layer that randomly excludes 20% of neurons in the layer
    model.add(Dropout(0.2))

    # An output layer with binary classification
    model.add(Dense(1, activation='sigmoid'))

    # Compile model with crossentropy
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

    model = NN_model()

# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, verbose=0)

# Evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)

print("NN Error: %.2f%%" % (100-scores[1]*100))
print(model.summary())

NN Error: 21.50%
Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_24 (Dense)            (None, 128)               38528     
                                                                 
 dropout_15 (Dropout)        (None, 128)               0         
                                                                 
 dense_25 (Dense)            (None, 128)               16512     
                                                                 
 dropout_16 (Dropout)        (None, 128)               0         
                                                                 
 dense_26 (Dense)            (None, 1)                 129       
                                                                 
Total params: 55169 (215.50 KB)
Trainable params: 55169 (215.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

In practice we often take advantage of existing code and architecture to help accomplish deep learning tasks. This can range from taking a neural network architecture and adapting it to new data (as in our exercise above) to using other packages with pre-trained models. In the next section we will explore one such package called Huggingface.

# Huggingface

Objectives:
- Explore tasks and data available in Huggingface transformers
- Choose an appropriate language task
- Implement a transformer on local data

20 minutes

In reality, these models  require significant data and computational power, which can exceed the resources available to the analyst. We can circumvent this problem by using pre-trained models. Like a pre-trained embedding model, pre-trained models are trained on a large dataset. While this may not perfectly align with the data or task you have, it can help create a more robust system that can be fine-tuned to your data and goals.

[Huggingface](https://huggingface.co/models) is a set of pretrained models from a variety of datasets and sources with an easy-to-use interface. In this section, we will explore the use of the Huggingface library to streamline language task processing.



In [57]:
#install the transformers library
!pip install transformers



The simplest strategy is to use the pipeline method, where you select the task and the pre-trained model (there are multiple models available for many of the tasks)

In [58]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The key to using these models, since the preprocessing is built in, is understanding the format of the data necessary for the model. This model takes the raw text as input rather than the word embeddings, so let's reload our data appropriately.

In [59]:
cfpb=pd.read_csv('https://raw.githubusercontent.com/dlab-berkeley/Computational-Social-Science-Training-Program/main/data/CFPB%202020%20Complaints.csv')
complaints=cfpb['Consumer complaint narrative']
complaints=complaints[~complaints.isna()]
classifier(complaints.values[0])
print(complaints.values[0])

Reviewed my credit report in XX/XX/XXXX and noticed a lot of errors, inconsistent, and incorrect information. Sent a letter to Equifax on XX/XX/XXXX via mail asking them for an investigation and to verify all the dates and amounts were correct and fix the incorrect reporting on my credit. They did not respond at all so I sent another letter on XX/XX/XXXX via mail, again asking for an investigation and proof. They still didnt respond to that letter so I sent a third letter on XX/XX/XXXX certified mail so I have proof that they signed for my letter.

Last week I received two letters from Equifax dated XX/XX/XXXX on the same day. The said that they could not locate my credit file and needed me to send proof of identification and address. With all three letters I sent a copy of my Arizona drivers license and my XXXX direct deposit sub as my proof of address. The second letter said that they received my request to be removed from the promotions list and that it was added to my credit file. 

Then use the pipeline on the example data, and look at the results.

In [60]:
for k in range(10):
  print(complaints.values[k])
  print(classifier(complaints.values[k]))

Reviewed my credit report in XX/XX/XXXX and noticed a lot of errors, inconsistent, and incorrect information. Sent a letter to Equifax on XX/XX/XXXX via mail asking them for an investigation and to verify all the dates and amounts were correct and fix the incorrect reporting on my credit. They did not respond at all so I sent another letter on XX/XX/XXXX via mail, again asking for an investigation and proof. They still didnt respond to that letter so I sent a third letter on XX/XX/XXXX certified mail so I have proof that they signed for my letter.

Last week I received two letters from Equifax dated XX/XX/XXXX on the same day. The said that they could not locate my credit file and needed me to send proof of identification and address. With all three letters I sent a copy of my Arizona drivers license and my XXXX direct deposit sub as my proof of address. The second letter said that they received my request to be removed from the promotions list and that it was added to my credit file. 

As you might expect, the complaints dataset has mostly negative values. While this is somewhat of a trivial example, it highlights how in just a few lines of code and no preprocessing we can implement a model on our own data. While this doesn't work for every task, for example the specific classification task that we were working with above, this is a valuable and powerful tool for quick, out-of-the-box models that don't take very long to initialize and tune.

# Large Language Model
We can use a large language model such as GPT-2 or GPT-Neo with the Hugging Face Transformers library. In this example, we will generate text using the GPT-2 model with the Hugging Face Transformers library. This example will use the code below to produce five different continuations of the provided input text.

We will use pipeline function. The pipeline function is a simple way to create a model with its associated preprocessing and postprocessing steps.  It's a high-level function provided by the Hugging Face Transformers library that makes it easy to load a pre-trained model with its associated tokenizer and use it for various tasks like text generation, sentiment analysis, and more. Using the pipeline function is not strictly necessary, but it does simplify the code by abstracting away several underlying details.

In [61]:
from transformers import pipeline, set_seed
# Here, 'text-generation' tells the pipeline to create a text generation model, and model='gpt2'
# specifies that you want to use the GPT-2 model for this task.
generator = pipeline('text-generation', model='gpt2')

In [62]:
set_seed(100)
input_text = "I study sociology at University of California"
# specifying the maximum length of the output, and the number of different sequences you want
output = generator(input_text, max_length=50, num_return_sequences=5)
for result in output:
    print(result['generated_text'])
    print("----")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I study sociology at University of California-Davis, U.S.A. and are working on a thesis on the relationship between gender roles and college students with ADHD. This article will discuss the research and current policy that has enabled me to use the
----
I study sociology at University of California Irvine. She writes:

"As a sociology major, I have a passion for learning, both for people and to make research decisions." A graduate student, she holds a graduate degree in anthropology from University of
----
I study sociology at University of California, Santa Barbara, where I worked as the director of the student-run "Papers on the Web."

I found myself talking to people who were living in the real world at a time when their voices
----
I study sociology at University of California San Francisco, and now I'm looking for a new job. If you're doing more than just study sociology, get me ready to apply here! (If not, I'll do my best to send you my
----
I study sociology at University of C

# Challenge
We can experiment changing different hyper parameters in the model above. Experiment by specifying different values for *temperature* parameter in generator function.
<br>
Temperature regulates the unpredictability of a language model's output. With higher temperature settings, outputs become more creative and less predictable as it amplifies the likelihood of less probable tokens while reducing that for more probable ones. A temperature value that is higher, such as 1.7, encourages the generation of text that is more varied and imaginative. Conversely, a lower temperature value, like 0.7, steers the output towards more concentrated and predictable text.

In [63]:
output = generator(input_text, max_length=50, num_return_sequences=5, temperature = 0.7)
for result in output:
    print(result['generated_text'])
    print("----")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I study sociology at University of California, Berkeley, and I'm interested in what it means for people to become social scientists and also to be part of the social sciences. I'm interested in how we understand people's social roles in society and how they
----
I study sociology at University of California, Berkeley.

The author is a member of the Board of Trustees of the University of California, Davis.
----
I study sociology at University of California, Oakland. I am an undergraduate in psychology at the University of California, Berkeley, and work for the Department of Social Work at the University of California, Berkeley.

My undergraduate years were primarily spent studying the
----
I study sociology at University of California, Berkeley. I'm fascinated by the nature of social change and the ways it's brought about in our society. How does the world affect you? When does social change affect your daily life?

I'm
----
I study sociology at University of California, Irvine. In the 

We used GPT2 to generate text above because it is free and runs relatively faster to be able to demonstrate in a classroom setting.  Sometimes, using a larger and more powerful model such as GPT-3 instead of GPT-2 can improve the quality of the generated text. Larger models often capture nuances and subtleties better. However, larger models often take longer time to run and some of the newest models are only available through a paid API.

## Challenge

Let's practice with another task from [huggingface](https://huggingface.co/docs/transformers/task_summary).

Let's say we want to check our data for grammatical correctness. We will use the CoLA model ("textattack/distilbert-base-uncased-CoLA") in the Text Classification pipeline ('text-classification') What is the grammatical correctness of each of the first 15 entries in the cfpb dataset?


In [64]:
#solution
classifier = pipeline("text-classification", model = "textattack/distilbert-base-uncased-CoLA")
classifier("I went to the bus.")


[{'label': 'LABEL_1', 'score': 0.9881684184074402}]

There are thousands of models on huggingface that can be used for a variety of language tasks. This can be a great way to use the models already available to increase our modeling power.

# Next Steps

This lab has introduced Colab as a way to use GPUs to speed up processing power and explored further applications of deep learning to natural language processing.

In practice, using deep learning for computational social science requires building on the foundational concepts covered in this notebook to implement models with more complicated data and architecture. However, there are many strategies can help you navigate the model ecosystem, some of which we will discuss here:

1. Documentation (and other resources like tutorials) is a goldmine of information for implementing particular algorithms and completing specific tasks. This is one reason why reading and translating code written by others is a key skill.

2. Debugging and interpreting error messages, as well as leveraging online resources in order to resolve them, is another key concept. Resources like documentation and Stack Overflow help solve common errors and get code working faster.

3. Computational resources are important for running complex models. Google Colab has access to GPUs, but does have limitations for large and extended jobs. In those cases, options are paid services such as Google Colab Pro or on-campus [resources](https://docs-research-it.berkeley.edu/services/high-performance-computing/overview/).

4. Further resources:
  - Deep Learning with Python (Francois Chollet)
  - [Huggingface course](https://huggingface.co/course/chapter1/1)
  - [Tensorflow](https://www.tensorflow.org/tutorials)




