<a href="https://colab.research.google.com/github/WoozieFR/test2/blob/main/notebooke6f8369de9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [33]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'pii-detection-removal-from-educational-data:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-competitions-data%2Fkaggle-v2%2F66653%2F7500999%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240206%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240206T103910Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D8c837cef8d9b761597ff100f92d3b4ac9af7b1bb78cb1107bbfad5c8d9faf8ffc5350e4c5645e21afab015568236ddbb8aa0f37100b78efe59ef63ddc7590f030b90a0d9d6eb52edafd3ec9ac8bc795e0d7333ef9495a44c7585b28c2ade77e34b940a010073cd5d75ee95c4f48e1b92a5b4350d48073bb4accd247221d878a0639cfb2237212e3cc256621ba01b5d4db4223e73c6be972a3ef03f5119f51d4a534e4e5514454927e9b7741cc9a4f50bc226e71bcea5ba44e0da745ffb5d9111281623a43b400a5970a33f9cb1a94a40689b7131939dab18f7542e6f2d7e4c38ae3611e662eb8ae8c5474b2642a1ab37afd02f48ce78254485bb72eb7a607e93'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading pii-detection-removal-from-educational-data, 22403094 bytes compressed
Downloaded and uncompressed: pii-detection-removal-from-educational-data
Data source import complete.


## Introduction

One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.

Perhaps the most widely practically useful application of NLP is *classification* -- that is, classifying a document automatically into some category. This can be used, for instance, for:

- Sentiment analysis (e.g are people saying *positive* or *negative* things about your product)
- Author identification (what author most likely wrote some document)
- Legal discovery (which documents are in scope for a trial)
- Organizing documents by topic
- Triaging inbound emails
- ...and much more!

Classification models can also be used to solve problems that are not, at first, obviously appropriate. For instance, consider the Kaggle [U.S. Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/) competition. In this, we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of `1` it is considered that the two inputs have identical meaning, and `0` means they have totally different meaning. For instance, *abatement* and *eliminating process* have a score of `0.5`, meaning they're somewhat similar, but not identical.

It turns out that this can be represented as a classification problem. How? By representing the question like this:

> For the following text...: "TEXT1: abatement; TEXT2: eliminating process" ...chose a category of meaning similarity: "Different; Similar; Identical".

In this notebook we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

### On Kaggle

Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills. There is nothing like being able to get hands-on practice and receiving real-time feedback to help you improve your skills. It provides:

1. Interesting data sets
1. Feedback on how you're doing
1. A leader board to see what's good, what's possible, and what's state-of-art
1. Notebooks and blog posts by winning contestants share useful tips and techniques.

The dataset we will be using here is only available from Kaggle. Therefore, you will need to register on the site, then go to the [page for the competition](https://www.kaggle.com/c/us-patent-phrase-to-phrase-matching). On that page click "Rules," then "I Understand and Accept." (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data.)

There are two ways to then use this data:

- Easiest: run this notebook directly on Kaggle, or
- Most flexible: download the data locally and run it on your PC or GPU server

If you are running this on Kaggle.com, you can skip the next section. Just make sure that on Kaggle you've selected to use a GPU during your session, by clicking on the hamburger menu (3 dots in the top right) and clicking "Accelerator" -- it should look like this:

![image.png](attachment:9af4e875-1f2a-468c-b233-8c91531e4c40.png)!

We'll need slightly different code depending on whether we're running on Kaggle or not, so we'll use this variable to track where we are:

In [34]:
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
from pathlib import Path

### Using Kaggle data on your own machine

Kaggle limits your weekly time using a GPU machine. The limits are very generous, but you may well still find it's not enough! In that case, you'll want to use your own GPU server, or a cloud server such as Colab, Paperspace Gradient, or SageMaker Studio Lab (all of which have free options). To do so, you'll need to be able to download Kaggle datasets.

The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using `pip` by running this in a notebook cell:

    !pip install kaggle

You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called *kaggle.json* to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell (e.g., `creds = '{"username":"xxx","key":"xxx"}'`):

In [35]:
creds = ''

Then execute this cell (this only needs to be run once):

In [36]:
# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

/kaggle/input/pii-detection-removal-from-educational-dataNow you can download datasets from Kaggle.

In [37]:
path = Path('us-patent-phrase-to-phrase-matching')

And use the Kaggle API to download the dataset to that path, and extract it:

Note that you can easily download notebooks from Kaggle and upload them to other cloud services. So if you're low on Kaggle GPU credits, give this a try!

## Import and EDA

In [39]:
if iskaggle:
    path = Path('../input/pii-detection-removal-from-educational-data')
    ! pip install -q datasets

In [40]:
path = Path('../input/pii-detection-removal-from-educational-data')
! pip install -q datasets

Documents in NLP datasets are generally in one of two main forms:

- **Larger documents**: One text file per document, often organised into one folder per category
- **Smaller documents**: One document (or document pair, optionally with metadata) per row in a [CSV file](https://realpython.com/python-csv/).

Let's look at our data and see what we've got. In Jupyter you can use any bash/shell command by starting a line with a `!`, and use `{}` to include python variables, like so:

In [41]:
!ls {path}

sample_submission.csv  test.json  train.json


It looks like this competition uses CSV files. For opening, manipulating, and viewing CSV files, it's generally best to use the Pandas library, which is explained brilliantly in [this book](https://wesmckinney.com/book/) by the lead developer (it's also an excellent introduction to matplotlib and numpy, both of which I use in this notebook). Generally it's imported as the abbreviation `pd`.

In [42]:
import pandas as pd

Let's set a path to our data:

In [43]:
df = pd.read_json(path/'train.json')

This creates a [DataFrame](https://pandas.pydata.org/docs/user_guide/10min.html), which is a table of named columns, a bit like a database table. To view the first and last rows, and row count of a DataFrame, just type its name:

In [44]:
df

Unnamed: 0,document,full_text,tokens,trailing_whitespace,labels
0,7,"Design Thinking for innovation reflexion-Avril 2021-Nathalie Sylla\n\nChallenge & selection\n\nThe tool I use to help all stakeholders finding their way through the complexity of a project is the mind map.\n\nWhat exactly is a mind map? According to the definition of Buzan T. and Buzan B. (1999, Dessine-moi l'intelligence. Paris: Les Éditions d'Organisation.), the mind map (or heuristic diagram) is a graphic representation technique that follows the natural functioning of the mind and allows the brain's potential to be released. Cf Annex1\n\nThis tool has many advantages:\n\n• It is accessible to all and does not require significant material investment and can be done quickly\n\n• It is scalable\n\n• It allows categorization and linking of information\n\n• It can be applied to any type of situation: notetaking, problem solving, analysis, creation of new ideas\n\n• It is suitable for all people and is easy to learn\n\n• It is fun and encourages exchanges\n\n• It makes visible the dimension of projects, opportunities, interconnections\n\n• It synthesizes\n\n• It makes the project understandable\n\n• It allows you to explore ideas\n\nThe creation of a mind map starts with an idea/problem located at its center. This starting point generates ideas/work areas, incremented around this center in a radial structure, which in turn is completed with as many branches as new ideas.\n\nThis tool enables creativity and logic to be mobilized, it is a map of the thoughts.\n\nCreativity is enhanced because participants feel comfortable with the method.\n\nApplication & Insight\n\nI start the process of the mind map creation with the stakeholders standing around a large board (white or paper board). In the center of the board, I write and highlight the topic to design.\n\nThrough a series of questions, I guide the stakeholders in modelling the mind map. I adapt the series of questions according to the topic to be addressed. In the type of questions, we can use: who, what, when, where, why, how, how much.\n\nThe use of the “why” is very interesting to understand the origin. By this way, the interviewed person frees itself from paradigms and thus dares to propose new ideas / ways of functioning. I plan two hours for a workshop.\n\nDesign Thinking for innovation reflexion-Avril 2021-Nathalie Sylla\n\nAfter modelling the mind map on paper, I propose to the participants a digital visualization of their work with the addition of color codes, images and interconnections. This second workshop also lasts two hours and allows the mind map to evolve. Once familiarized with it, the stakeholders discover the power of the tool. Then, the second workshop brings out even more ideas and constructive exchanges between the stakeholders. Around this new mind map, they have learned to work together and want to make visible the untold ideas.\n\nI now present all the projects I manage in this type of format in order to ease rapid understanding for decision-makers. These presentations are the core of my business models. The decision-makers are thus able to identify the opportunities of the projects and can take quick decisions to validate them. They find answers to their questions thank to a schematic representation.\n\nApproach\n\nWhat I find amazing with the facilitation of this type of workshop is the participants commitment for the project. This tool helps to give meaning. The participants appropriate the story and want to keep writing it. Then, they easily become actors or sponsors of the project. A trust relationship is built, thus facilitating the implementation of related actions.\n\nDesign Thinking for innovation reflexion-Avril 2021-Nathalie Sylla\n\nAnnex 1: Mind Map Shared facilities project\n\n","[Design, Thinking, for, innovation, reflexion, -, Avril, 2021, -, Nathalie, Sylla, \n\n, Challenge, &, selection, \n\n, The, tool, I, use, to, help, all, stakeholders, finding, their, way, through, the, complexity, of, a, project, is, the, , mind, map, ., \n\n, What, exactly, is, a, mind, map, ?, According, to, the, definition, of, Buzan, T., and, Buzan, B., (, 1999, ,, Dessine, -, moi, , l'intelligence, ., Paris, :, Les, Éditions, d'Organisation, ., ), ,, the, mind, map, (, or, heuristic, diagram, ), is, a, graphic, , representation, technique, that, follows, the, natural, functioning, of, the, mind, and, allows, the, brain, ...]","[True, True, True, True, False, False, True, False, False, True, False, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, False, False, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, False, False, True, False, False, True, False, False, True, False, True, True, True, False, False, False, True, True, True, True, False, True, True, False, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, ...]","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
1,10,"Diego Estrada\n\nDesign Thinking Assignment\n\nVisualization Tool\n\nChallenge & Selection\n\nThe elderly were having a hard time adapting to the changes we brought in our bank. As a result of a poorly implemented linear solution, a more customer centric approach was needed.\n\nAfter learning about design thinking in this course, we decided to apply it to solve this problem. The visualization tool allowed the team to create a dynamic presentation using diagrams, figures and drawings on the go that really resonated among the stakeholders. Previous to this change, none of our solutions seemed to be adequate for them, but the new implementation created a different type of connection with them that helped them understand the problem in the way the team and I did.\n\nApplication\n\nThe process starts in the prep time. The team uses a series of tools and software to develop a presentation using the surveys gathered during research and the solutions we created during the process. The use of graphs to quickly show statistics in a fully visual way, rather than verbally was a game changer.\n\nAfter having a presentation prepared, the team hands an activity to the stakeholders, where the solutions discussed previously appear. Nonetheless, the solutions need more work to them. After this. The stakeholders are asked to help complete the solutions while the team and I create diagrams on a blackboard to represent how their suggestions would impact on this specific problem.\n\nThe use of a group activity strengthens the bond between the company and their investors. It makes them feel like they take part and help solve the problems as well as show how customer centric the solutions are. Every complaint and suggestion from customers are read and evaluated using the graph shown in the course (Involving: can we do it? Can we afford it? …). The finalization of this activity leaves the team and the stakeholders on the same page. It allows them to completely understand and feel part of the solution and also gives them the chance to ask better questions, which eases the work of the team.\n\nInsight & Approach\n\nThe use of this method created a new workflow in the Design Team. It increased the productivity and the success rate as well as the customer/stakeholders satisfaction. The use of the visualization tool created an engaged group of people who work together to\n\nDiego Estrada\n\nfind a solution based on their customer satisfaction. This solution is later revised and tweaked with the help of the stakeholders who are deeply involved in the process.\n\nPresentations, graphics, and activities have added a huge increase in satisfaction. As a company we also learnt that engaging different areas can be difficult because of the varying levels of understanding, but when paired with the adequate process things just flow.\n\n(This story is fictional and was created for solving the assignment)\n\n","[Diego, Estrada, \n\n, Design, Thinking, Assignment, \n\n, Visualization, Tool, \n\n, Challenge, &, Selection, \n\n, The, elderly, were, having, a, hard, time, adapting, to, the, changes, we, brought, in, our, bank, ., As, , a, result, of, a, poorly, implemented, linear, solution, ,, a, more, customer, centric, approach, was, , needed, ., \n\n, After, learning, about, design, thinking, in, this, course, ,, we, decided, to, apply, it, to, solve, this, , problem, ., The, visualization, tool, allowed, the, team, to, create, a, dynamic, presentation, using, , diagrams, ,, figures, and, drawings, on, the, go, that, really, resonated, among, the, stakeholders, ., ...]","[True, False, False, True, True, False, False, True, False, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, False, False, False, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, ...]","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
2,16,"Reporting process\n\nby Gilberto Gamboa\n\nChallenge\n\nI received a promotion of being the Regional Controller, along with my actual position of\n\nCountry CFO. The main responsibility of this new position was to weekly report the results\n\nfor the week and estimate the final results of the month of 4 countries and consolidated\n\nthose.\n\nWhen I was receiving the position, I went to visit my colleague, former Regional Controller,\n\nwho was promoted to Country CEO and now had interest conflicts of being the controller.\n\nThe process to consolidate the information of the 4 countries was that the country controllers\n\nsent him an email with the main figures for the week, he forwarded those to his country\n\naccountant who consolidated it, the accountant sent him the consolidated report and he\n\nfinally reported to the headquarters. The whole process took almost a full business day to\n\ncomplete.\n\nGiven that my responsibilities as Country CFO demanded more attention because my\n\ncountry had more operations, I decided to change the process in order to reduce the\n\nduration and to ensure standardization in the format, and actually, reduce the human\n\nintervention, making that the country controllers work directly in the consolidation file.\n\nSelection\n\nHaving in mind that there was a different kind of users of file, I select some of those to\n\ndetermine what was the main important things to take into account in the moment of the\n\nprocess of the information and the reading of the same. In that sense, we form a group of\n\nthe country controllers, country CEOs, IT guys, and people from the headquarters to find the\n\nbest solutions possible.\n\nApplication\n\nFor the first lunch, we focused on the consolidation process in order to avoid the copy-paste\n\nprocesses and reducing the manual intervention so, we build an online application where all\n\nthe controllers fill the figure of their respective country, along with the comments. During the\n\nfirst week of the first stage, we sent the new report along with the old one, and after the\n\nmeeting with the headquarters team, we ask for a post-meeting review of the new format, all\n\nthe assistants provided their comments and suggestions that were the input for the next\n\nreport.\n\nFor the second lunch, we focused on the feedback received from the assistants to the review\n\nmeeting, we adjust the report and we were able to eliminate the old one. The final report\n\nincluded all the suggestions received but the best of all is that reduced the time investment\n\nfrom about 36 men hours to around 8, without missing any valuable information and\n\nincluding new data that the stakeholders appreciated so much.\n\nInsight\n\nWith the application of the learning launch tool, the controller’s team along with the main\n\nstakeholders identified different assumptions and designed tools to test these assumptions.\n\nOn the other hand, we found probable requirements from headquarters, expecting to find\n\nthat a more agile approach that improved the workflow, reduced the time investment of\n\neveryone in the team and that both our team and the key stakeholders were very satisfied\n\nwith the results of the exercise and the new report.\n\nThe final report was slightly different from what we anticipated, but the differences were\n\nmore related to form and a few topics to be included in the report.\n\nApproach\n\nDespite that, the team was not used to design thinking tools, they were able to work with the\n\nlearning launch that was the appropriate tool. The team needs to review the insight gained\n\nfrom our first two launches and continuously evaluate this insight and new ones into future\n\nlaunch designs, especially taking into account that the full automation of the reports will take\n\nat least 4 years more according to the ERP implementation plan of the headquarter.\n\n","[Reporting, process, \n\n, by, Gilberto, Gamboa, \n\n, Challenge, \n\n, I, received, a, promotion, of, being, the, Regional, Controller, ,, along, with, my, actual, position, of, \n\n, Country, CFO, ., The, main, responsibility, of, this, new, position, was, to, weekly, report, the, results, \n\n, for, the, week, and, estimate, the, final, results, of, the, month, of, 4, countries, and, consolidated, \n\n, those, ., \n\n, When, I, was, receiving, the, position, ,, I, went, to, visit, my, colleague, ,, former, Regional, Controller, ,, \n\n, who, was, promoted, to, Country, CEO, and, now, had, interest, conflicts, of, being, the, controller, ., \n\n, The, ...]","[True, False, False, True, True, False, False, False, False, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, False, False, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, True, ...]","[O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
3,20,"Design Thinking for Innovation\n\nSindy Samaca\n\nGitam University\n\nDecember, 2021\n\nChallenge\n\nMy challenge to solve is the problem of a company that is without\n\nsufficient capital to create and build its own industrial plant, and cannot\n\ncope with the growing demand that they have had in the last two months.\n\nThe company is in a dilemma, as it has two potential investors; The first\n\ninvestor offers the amount of capital necessary for the creation of the\n\nplant, in return he asks to intervene in the process and make certain\n\nchanges in the final product since he thinks that the idea he has has\n\nshown good results in previous years and to have the participation of 50%\n\nin the company, the second investor offers the capital necessary for the\n\ncreation of the plant in exchange for do not make modifications to the\n\nfinal product, since this is the one that is generating the exponential\n\ngrowth of the company, for the moment it is only asked to focus on the\n\ncompany's success product and as they recover the investment they can\n\nbe extended to the innovation of products and creation for a 20% stake\n\nwithin the company.\n\nAs a manager of the company, I need to make the best decision for the\n\nabove mentioned one.\n\nSelection\n\nI decided to choose the visualization tool, visualization allows me to\n\nunderstand and analyze the context, to be able to define the needs and\n\ntheir possible solutions. This tool is useful for me to have a clearer vision\n\nof the current state of the company and what it ultimately requires;\n\nthrough this tool I can deduce the personalities of the two investors, such\n\nas George and Geoff; and based on this I can make the best decision.\n\nApplication\n\nIn the application of my chosen tool to the challenge that I have set\n\nmyself, the tool has been of a very wide utility, since when applying it to\n\nmy challenge very good results were obtained, since with this I could have\n\nan analysis to the measure of the two investors. The first thing I did with\n\nmy visualization tool was to conduct a study to customers who required\n\nthe flagship product of the company, first I had to understand why the\n\ndemand had had such a strong growth in the last two months, by applying\n\nthe tool I could find the answers to this question; I could see that\n\ncustomers needed this product as it was, without any modification\n\nbecause the product worked wonders.\n\nWith this study of the clients interested in the product, I was able to\n\nanalyze and follow up on the proposals I had on the table from the two\n\ninvestors.\n\nI concluded that I needed the person who was interested in the first\n\ninstance to supply the product that was having great success. Which was\n\nthe second investor, since he was interested in first of all to give the size\n\nof production that our customers demanded, and then worry about\n\nimproving it, but in small and concise steps.\n\nI discarded the first investor, as he did not meet the requirements that the\n\ncompany needed at that time. I realized that this investor was convinced\n\nthat by changing the product to the way he thought, the demand was\n\ngoing to increase more; but in reality what the company needed was to be\n\nable to cover the demand that the product was having, this investor\n\nwanted to throw everything we had, to the side of the road, but he did not\n\nthink for a moment that these modifications should have a low impact in\n\ncase things do not go as well as expected.\n\nThanks to the application of the tool I was able to identify the key points\n\nof the company's needs and how each investor, according to their\n\nproposals, could help me, based on this I made the best decision. I chose\n\nto work with the second investor, we were able to supply the demand for\n\nthe product, then the investment tripled and we were able to inject it into\n\nresearch for product innovation, and make some small changes, which\n\nwere being introduced to customers discreetly and in small quantities.\n\nWhen customers became more interested in the improved product, we\n\ndecided to increase production by improving the product.\n\nPerspective\n\nThe information and the final lesson that I had when socializing what we\n\nobtained by using the visualization tool, the first thing is that we must\n\nalways think first about the needs of customers, of the people interested\n\nin the product; it is very useful to know what the problem is, what is the\n\ncause of the problem and finally build an effective solution. Secondly,\n\nbased on what we have obtained, we can find a partner that fits our\n\nneeds, and be able to work as a team, unifying ideas and producing\n\npossible solutions. Thanks to the knowledge gained from the design\n\ncourse, I was able to create a strategic plan for the company's problems\n\nand make it a successful plan.\n\nApproach\n\nWhat I would apply differently in the challenge would be the tool used, I\n\nwould like to use the learning launch; I would do it in the following way,\n\nrejecting the proposals of the two investors and only creating the way that\n\nthe production process would be decreased to be able to supply the\n\nrequired demand. It may or may not work out, but I would like to have\n\nemployed this strategy.\n\n","[Design, Thinking, for, Innovation, \n\n, Sindy, Samaca, \n\n, Gitam, University, \n\n, December, ,, , 2021, \n\n, Challenge, \n\n, My, challenge, to, solve, is, the, problem, of, a, company, that, is, without, \n\n, sufficient, capital, to, create, and, build, its, own, industrial, plant, ,, and, can, not, \n\n, cope, with, the, growing, demand, that, they, have, had, in, the, last, two, months, ., \n\n, The, company, is, in, a, dilemma, ,, as, it, has, two, potential, investors, ;, The, first, \n\n, investor, offers, the, amount, of, capital, necessary, for, the, creation, of, the, \n\n, plant, ,, in, return, he, asks, to, ...]","[True, True, True, False, False, True, False, False, True, False, False, False, True, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, False, True, True, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, False, False, False, True, True, True, True, True, True, ...]","[O, O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
4,56,"Assignment: Visualization Reflection Submitted by: Nadine Born Course: Design Thinking for Innovation Trail Challenge: To Build or Not to Build An environmental charity wanted to conduct a fundraising campaign to raise $4 million to build a public path in a busy tourist area of a small town in British Columbia, Canada. They had been gifted a large piece of land by a local landowner, which was a substantial gift and prevented them from needing to purchase the land, however, they still needed to raise a large amount of money in order to pay for the supplies and labor to build the trail. Even though the local community appeared to be supportive of the trail, they could not provide enough money from private donations to build it. If the summer vacation property owners did not provide some funding, then there was a strong possibility that they would not raise enough money to complete the trail. The charity did not know if the community as a whole would support the project and needed to conduct testing with key influencers and potential donors to gauge their interest. Building the trail without testing the support first was too risky because the charity did not have enough money in reserve to cover the cost of the trail if the fundraising efforts were not successful. Tool Selection: Visualization Visualization is the process of “assembling scattered ideas into a compelling story that can generate vivid mental images” (Designing for Growth, p49). As the consultant for the study, I chose visualization because the charity had a firm concept of why they needed the trail, how it would benefit the town, and how much it would cost but needed a persuasive way to tie it all together. The business case for the project was strong but without a tool to help them illustrate how the trail would positively impact the residents, there was little chance people would donate enough to meet the budget. We needed a tool that provided a “head and heart” message to convince people to support the project. Visualization provided the perfect combination of key messaging, beautiful photography, architectural renderings, safety data, and budget criteria to create the vision for the project in an easy‐to‐read document that was only four pages in length. Visualization allowed us to describe the urgent and compelling need for the trail in a succinct and tangible way. Application Once we drafted the vision document, we worked with the charity to identify a list of people whose opinion would be important to the success (or failure) of the fundraising campaign. The list included past and potential donors, key influencers in the community such as large landowners and business owners, affluent summer‐only residents, and elected officials. We requested one‐hour meetings with all of the people on the list. If people did not want to meet with us in person, which was often the\n\ncase with the part‐time residents, we offered to conduct the meetings by phone. When someone agreed to meet with us, we emailed them the vision document so they could read it in advance and prepare their questions. This created a good environment for an informed and candid dialogue. While the scheduling of the interviews was in progress, we designed a questionnaire to guide our discussions. Consistently using the questionnaire ensured that we covered the same questions with all the interviewees. The goal was to speak with 20 – 25 key influencers in the community and gauge their interest in, or opposition to, supporting the fundraising efforts for the trail as either donors or campaign volunteers or both. We successfully met with 24 interviewees and compiled the feedback into a summary report along with recommendations for the charity. The entire process took three months. Insight Fundraising and design thinking both require a willingness to adapt and fail fast. Good fundraisers are responsive to their donors and design thinking serves as the perfect platform to plan and launch new fundraising initiatives; it is an ideal methodology for solving complex philanthropic issues. We are not formally taught design thinking models in fundraising classes but they should be added to the curriculum. Although it is not explicitly stated in the course videos, it occurred to me that both fundraising and design thinking are rooted in communication and relationships and both are iterative processes based on testing and investigation. The elusive synergy between art and science is as beautifully illustrated by design thinking as it is as inherent in daily fundraising practice. Improvements for Next Time Visualization was an effective tool in this circumstance and I would use it again in a similar situation. However, it would also be enlightening to create a journey map with a wide spectrum of people from the town because there were many assumptions made during the visualization process about how the trail would be used and what benefits would be desired by the tourists, local public, and sponsors. Journey mapping would have confirmed the validity of those assumptions and supplemented the data gathered during the interviews. The other tool that I would use next time in conjunction with visualization is storytelling. I would take someone’s first hand account of the situation and create a short video to showcase the urgency of the project or program from his/her perspective. A link to the video would be included in the email about the interview along with a PDF of the vision document. I think this would provide a holistic micro and macro perspective to the exercise and spark some interesting conversations in the interviews.\n\n","[Assignment, :, , Visualization, , Reflection, , Submitted, , by, :, , Nadine, Born, , Course, :, , Design, , Thinking, , for, , Innovation, , Trail, , Challenge, :, , To, , Build, , or, , Not, , to, , Build, , An, , environmental, , charity, , wanted, , to, , conduct, , a, , fundraising, , campaign, , to, , raise, , $, 4, , million, , to, , build, , a, , public, , path, , in, , a, , busy, , tourist, , area, , of, , a, , small, , town, , in, , ...]","[False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, ...]","[O, O, O, O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
...,...,...,...,...,...
6802,22678,"EXAMPLE – JOURNEY MAP\n\nTHE CHALLENGE My wife owns a business that sells health and beauty products. As a result of the codvid pandemic and the restrictions that are in place until now, sales fell. delivery personnel had to be fired and operations slowed down. There are restrictions for the sale of products in store. Although deliveries are made at home and at no additional cost, sales do not go up. The option to reduce the price of the product was not an option, there had been product promotions without great variations in sales THE SELECTION We provide solutions through the design thinking method with the aim of increasing sales. but lacking quantitative information, we opted for a journey map as a tool. We carry out the research with regular customers of the products. The journey map was done with 4 costumers. THE APPLICATION For the application of the jurney map we contacted frequent clients, who through interviews via ZOOM could not be face-to-face due to bio-security measures, key questions were asked and they were used to show the benefits of the products. The discoveries when talking directly with our clients generated several hypotheses. We were somehow advancing what would be WHAT IF ?. The results generated in the interviews and through the journal map we were able to do interviews every other day, for a week 15 minutes. the new normal allows us to be flexible in our research methods. In a next meeting and with the results obtained, we opted for some ideas for rapid implementation. INSIGHT & APPROACH The information collected from our clients was very valuable. Despite the quarantine, they continued to buy the products.\n\n• underlying diseases (reunatism, joint pain)\n\n• Being natural products, our clients felt that they were better than chemical ones.\n\n• Customers knew that the products we sold were fortified with VITAMINS A B C E D OMEGA 3.\n\n• in times of codvid, is necessary to reinforce the immune system with vitamins.\n\nThey valued that the delivery personnel have biosecurity measures in the deliveries of the product. The data we obtained led us to make the decision to make a change in communication on social networks, which is the channel through which more sales are made. We made multimedia material showing the staff making the deliveries with all the biosafety measures, we made more emphasis on the consumption of the product as a vitamin supplement, which in addition to doing good for basic diseases were reinforced with vitamins. Carrying out these two actions obtained from the journey maps and interviews with clients, they gave us enough feedback and insights to better analyze how to arrive objectively in the communication of the product. Although we have not yet had an exponential increase in sales, the level of inquiries about potential customers has risen, which indicates that we are on the right track. The non-traditional tools that design thinking gives us are applicable and flexible even for small businesses such as my wife.\n\n","[EXAMPLE, –, JOURNEY, MAP, \n\n, THE, CHALLENGE, , My, wife, owns, a, business, that, sells, health, and, beauty, products, ., As, a, result, of, the, codvid, pandemic, and, , the, restrictions, that, are, in, place, until, now, ,, sales, fell, ., delivery, personnel, had, to, be, fired, and, operations, , slowed, down, ., There, are, restrictions, for, the, sale, of, products, in, store, ., Although, deliveries, are, made, at, , home, and, at, no, additional, cost, ,, sales, do, not, go, up, ., The, option, to, reduce, the, price, of, the, product, was, not, , an, option, ,, there, had, ...]","[True, True, True, False, False, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, True, False, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, True, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, ...]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
6803,22679,"Why Mind Mapping?\n\nMind maps are graphical representations of information. In contrast to the traditional, linear notes you might make in a text document or even on paper, mind maps let you capture thoughts, ideas and keywords on a blank canvas. These ideas are organized in a two-dimensional structure, with the title/main idea always located in the center of the map for visibility. Related ideas branch off from the center in all directions, creating a radiant structure.\n\nDespite these key principles, the fact that mind mapping has existed for almost half a century makes it inevitable that some divergence will exist when it comes to defining what a mind map actually is.\n\nmind mapping is a particular technique that requires the following key elements to be effective:\n\n• A central image, to stimulate memory, associations and thought\n\nprocesses\n\n• Curvilinear branches, emanating from the central image, to depict\n\nthe basic ordering ideas (BOIs)\n\n• A (theoretically infinite) network of smaller branches to depict ideas\n\nstemming from the BOIs at different levels of detail\n\n• Conscious use of color to separate ideas by topic\n\n• A single keyword for each branch\n\nMany purists argue to this day that his method remains the one true technique: a map without the elements above cannot be considered a “true” mind map.\n\nWhen it comes to creating your mind map, the most important things to consider are what you need and how you learn best. Sometimes, you may not even need a mind map at all. While almost all mapping techniques were developed as an alternative to long-form text and linear notes, there are plenty of situations where linear note-taking is a perfectly suitable method.\n\nYour needs and goals should also be considered when you decide how to create the mind map itself. While traditional paper mind maps are great for developing ideas by yourself, the development of online mind mapping tools has enabled millions of people to brainstorm and plan together in real time. It’s for this reason that MindMeister, our very own online mind mapping tool, features such an intense focus on collaboration and group work.\n\n","[Why, Mind, Mapping, ?, \n\n, Mind, maps, are, graphical, representations, of, information, ., In, contrast, to, the, , traditional, ,, linear, notes, you, might, make, in, a, text, document, or, even, on, paper, ,, , mind, maps, let, you, capture, thoughts, ,, ideas, and, keywords, on, a, blank, canvas, ., , These, ideas, are, organized, in, a, two, -, dimensional, structure, ,, with, the, title, /, main, , idea, always, located, in, the, center, of, the, map, for, visibility, ., Related, ideas, branch, , off, from, the, center, in, all, directions, ,, creating, a, radiant, structure, ., \n\n, Despite, these, ...]","[True, True, False, False, False, True, True, True, True, True, True, False, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, False, True, True, True, True, True, True, False, False, True, False, True, True, True, False, False, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, False, False, False, True, True, ...]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
6804,22681,"Challenge\n\nSo, a few months back, I had chosen to intern with a start-up that was just three months old to understand the nitty-gritty of the initial phases of an organization set-up. We were given the challenge to formulate a Business plan. The challenges they faced were Brand positioning, market penetration, and customer segmentation of the target market. As the start-up was targeting a very niche market, there wasn’t much data available on the same. So, we faced a challenge wherein no other similar business was in-place to take as a reference. The outline of the project was to get candidates specifically from PSU (Government-run organization) to enroll for the mentorship program to pursue higher education\n\nSelection\n\nDue to a lack of data and a similar business model, we realized that Storytelling would be a crucial tool in conveying our project effectively. We felt that we need to ignite and unite our mentors to make our customers believe. We also felt that connecting with people and their emotions, as conveyed in the Storytelling tool video, would enable us to understand them better and our project as well.\n\nApplication\n\nSince there was no data available with us, we decided to perform some research and generate our data by interviewing some immediate stakeholders like current PSU employees and mentors since mentors have traversed through the same path. We found many crucial insights through this activity. We also got some hard facts based on which we started to build some recommendations. By the time we came up with some suggestions, we had realized we need to build suspense to develop their interest. Therefore, we started re-collecting the “Ahaa” moments that we experienced while solving problems related to various aspects of the B-plan discussed above. We began to design our presentation by making a concerted effort to get our audience in the frame of mind, which we required them to be.\n\nThrough these insights and planned efforts, we were able to deliver a presentation that connected with our mentors who had been through a similar path. Before addressing every problem (Brand positioning, market penetration, and customer segmentation of the target market), we built suspense before addressing that issue. Hence, our panelists were always interested and on the edge of their seats. Due to this, we were able to deliver an effective solution to our panelists and deliver it in a way that connected well with the mentors as well. We eventually won a pre-placement offer from the start-up.\n\nInsights\n\nAfter the presentation, we saw that some other submissions had similar recommendations, but the way we designed and conveyed our work, in the form of a story with spikes laid out in between, was admired and valued by the panelists and mentors. Hence, we realized that the effective communication of an idea plays a vital role in conveying an idea.\n\nApproach\n\nAfter completing the course, I realized that we could have used Mind-mapping in conjunction with Storytelling, which could have aligned our thoughts in a better way, and we could have come up with better insights through structured thinking.\n\n","[Challenge, \n\n, So, ,, a, few, months, back, ,, I, had, chosen, to, intern, with, a, start, -, up, that, was, just, three, months, old, , to, understand, the, nitty, -, gritty, of, the, initial, phases, of, an, organization, set, -, up, ., We, were, given, , the, challenge, to, formulate, a, Business, plan, ., The, challenges, they, faced, were, Brand, positioning, ,, , market, penetration, ,, and, customer, segmentation, of, the, target, market, ., As, the, start, -, up, was, , targeting, a, very, niche, market, ,, there, was, n’t, much, data, available, on, the, same, ., So, ,, ...]","[False, False, False, True, True, True, True, False, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, False, True, True, True, False, False, True, True, True, True, True, True, True, True, False, False, False, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, False, True, False, True, True, True, True, True, True, True, False, True, True, True, False, False, True, True, False, True, True, True, True, False, True, True, False, True, True, True, True, True, True, False, True, False, True, ...]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
6805,22684,"Brainstorming\n\nChallenge & Selection\n\nBrainstorming is a very powerful method for generating ideas that can solve a problem at the early stage of a design process. Its purpose is to invoke uninhibited thoughts and it can be practiced by an individual or a group of people. Group should meet in a comfortable environment equipped with stationery and feel free to throw any idea or thought, even those that might seem ridiculous. It is very important that group is diverse and consist of people from a wide range of backgrounds. Although this method focuses on quantity rather than quality, creativity of ideas is highly desirable. Most important advantages of this technique are that many ideas can be generated in a short time period, great deal of them can be put aside for later projects, and it doesn’t require lots of resources.\n\nWhen my team faced the challenge to create a new, motivating mobile application that will inspire users to set and accomplish their goals in life, this method proved to be an excellent method to set the foundations. Team itself was consist of members from different parts of the world and that also played pivotal role in the design process. All of my peers were familiar with applications that were trying to solve the same problem statement: People tend to set goals to improve their lives but very few manage to accomplish them. Since there were several similar applications on the market already, the biggest challenge was to determine what can we do to make our app better than the others. We found the answer to that question during the brainstorming session, along with the ideas for several more features that will motivate users and make our product stand out from the others.\n\nApplication\n\nDuring the brainstorming session team generated lots of creative ideas. Then we organized them into groups and subgroups in order to analyze all generated data. Having ideas organized like that, it was much easier to evaluate them, compare them with the current state of the art, predict their cost and feasibility, and finally, decide on which ones were the best. We came up with the key solution to our problem statement. What we needed to make our app successful was the instrument of encouragement. We also figured out how to motivate our users gradually, which was very important because it usually takes time to accomplish a goal.\n\nInsight & Approach\n\nBrainstorming is a technique that never failed me in my experience so far. In this particular case it had provided me and my team with lots of creative ideas which helped us form a set of features for a new mobile application that can stand fresh and original against what was currently on the market. Brainstorming method fosters collaboration, creativity, and productivity. To ensure a successful outcome it is good to have an experienced moderator to facilitate the session. Experienced moderator will know how to encourage participants to come up with more ideas and how to build up on them. He will also keep in mind that it is very important to have a well defined problem statement, and make sure that focus stays on users that the group is trying to solve the problem for. That is something that we, as we were unexperienced designers at that time, didn’t know. Although we had defined the problem statement and kept focus on it, we could have completed a successful brainstorming session in much less time if we had had an experienced facilitator to lead our group.\n\n","[Brainstorming, \n\n, Challenge, &, Selection, \n\n, Brainstorming, is, a, very, powerful, method, for, generating, ideas, that, can, solve, a, problem, at, the, , early, stage, of, a, design, process, ., Its, purpose, is, to, invoke, uninhibited, thoughts, and, it, can, be, , practiced, by, an, individual, or, a, group, of, people, ., Group, should, meet, in, a, comfortable, , environment, equipped, with, stationery, and, feel, free, to, throw, any, idea, or, thought, ,, even, those, , that, might, seem, ridiculous, ., It, is, very, important, that, group, is, diverse, and, consist, of, people, from, , a, wide, range, of, backgrounds, ...]","[False, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, ...]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"


In [45]:
pd.set_option('display.max_colwidth', None)

In [46]:
df[['tokens', 'labels']].iloc[0]

tokens    [Design, Thinking, for, innovation, reflexion, -, Avril, 2021, -, Nathalie, Sylla, \n\n, Challenge, &, selection, \n\n, The, tool, I, use, to, help, all, stakeholders, finding, their, way, through, the, complexity, of, a, project, is, the,  , mind, map, ., \n\n, What, exactly, is, a, mind, map, ?, According, to, the, definition, of, Buzan, T., and, Buzan, B., (, 1999, ,, Dessine, -, moi,  , l'intelligence, ., Paris, :, Les, Éditions, d'Organisation, ., ), ,, the, mind, map, (, or, heuristic, diagram, ), is, a, graphic,  , representation, technique, that, follows, the, natural, functioning, of, the, mind, and, allows, the, brain, ...]
labels                                                                                                                                                                                                                                                                                                                          [O, O, O, O, O, O, O, O, O, 

It's important to carefully read the [dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) to understand how each of these columns is used.

One of the most useful features of `DataFrame` is the `describe()` method:

In [47]:
df.describe(include='object')

Unnamed: 0,full_text,tokens,trailing_whitespace,labels
count,6807,6807,6807,6807
unique,6807,6807,6806,2221
top,"Design Thinking for innovation reflexion-Avril 2021-Nathalie Sylla\n\nChallenge & selection\n\nThe tool I use to help all stakeholders finding their way through the complexity of a project is the mind map.\n\nWhat exactly is a mind map? According to the definition of Buzan T. and Buzan B. (1999, Dessine-moi l'intelligence. Paris: Les Éditions d'Organisation.), the mind map (or heuristic diagram) is a graphic representation technique that follows the natural functioning of the mind and allows the brain's potential to be released. Cf Annex1\n\nThis tool has many advantages:\n\n• It is accessible to all and does not require significant material investment and can be done quickly\n\n• It is scalable\n\n• It allows categorization and linking of information\n\n• It can be applied to any type of situation: notetaking, problem solving, analysis, creation of new ideas\n\n• It is suitable for all people and is easy to learn\n\n• It is fun and encourages exchanges\n\n• It makes visible the dimension of projects, opportunities, interconnections\n\n• It synthesizes\n\n• It makes the project understandable\n\n• It allows you to explore ideas\n\nThe creation of a mind map starts with an idea/problem located at its center. This starting point generates ideas/work areas, incremented around this center in a radial structure, which in turn is completed with as many branches as new ideas.\n\nThis tool enables creativity and logic to be mobilized, it is a map of the thoughts.\n\nCreativity is enhanced because participants feel comfortable with the method.\n\nApplication & Insight\n\nI start the process of the mind map creation with the stakeholders standing around a large board (white or paper board). In the center of the board, I write and highlight the topic to design.\n\nThrough a series of questions, I guide the stakeholders in modelling the mind map. I adapt the series of questions according to the topic to be addressed. In the type of questions, we can use: who, what, when, where, why, how, how much.\n\nThe use of the “why” is very interesting to understand the origin. By this way, the interviewed person frees itself from paradigms and thus dares to propose new ideas / ways of functioning. I plan two hours for a workshop.\n\nDesign Thinking for innovation reflexion-Avril 2021-Nathalie Sylla\n\nAfter modelling the mind map on paper, I propose to the participants a digital visualization of their work with the addition of color codes, images and interconnections. This second workshop also lasts two hours and allows the mind map to evolve. Once familiarized with it, the stakeholders discover the power of the tool. Then, the second workshop brings out even more ideas and constructive exchanges between the stakeholders. Around this new mind map, they have learned to work together and want to make visible the untold ideas.\n\nI now present all the projects I manage in this type of format in order to ease rapid understanding for decision-makers. These presentations are the core of my business models. The decision-makers are thus able to identify the opportunities of the projects and can take quick decisions to validate them. They find answers to their questions thank to a schematic representation.\n\nApproach\n\nWhat I find amazing with the facilitation of this type of workshop is the participants commitment for the project. This tool helps to give meaning. The participants appropriate the story and want to keep writing it. Then, they easily become actors or sponsors of the project. A trust relationship is built, thus facilitating the implementation of related actions.\n\nDesign Thinking for innovation reflexion-Avril 2021-Nathalie Sylla\n\nAnnex 1: Mind Map Shared facilities project\n\n","[Design, Thinking, for, innovation, reflexion, -, Avril, 2021, -, Nathalie, Sylla, \n\n, Challenge, &, selection, \n\n, The, tool, I, use, to, help, all, stakeholders, finding, their, way, through, the, complexity, of, a, project, is, the, , mind, map, ., \n\n, What, exactly, is, a, mind, map, ?, According, to, the, definition, of, Buzan, T., and, Buzan, B., (, 1999, ,, Dessine, -, moi, , l'intelligence, ., Paris, :, Les, Éditions, d'Organisation, ., ), ,, the, mind, map, (, or, heuristic, diagram, ), is, a, graphic, , representation, technique, that, follows, the, natural, functioning, of, the, mind, and, allows, the, brain, ...]","[True, True, True, False, False, True, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, False, True, False, True, True, False, False, False, True, ...]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...]"
freq,1,1,2,20


We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with "component composite coating" for instance appearing 152 times.

Earlier, I suggested we could represent the input to the model as something like "*TEXT1: abatement; TEXT2: eliminating process*". We'll need to add the context to this too. In Pandas, we just use `+` to concatenate, like so:

In [None]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

We can refer to a column (also known as a *series*) either using regular python "dotted" notation, or access it like a dictionary. To get the first few rows, use `head()`:

In [None]:
df.input.head()

## Tokenization

Transformers uses a `Dataset` object for storing a... well a dataset, of course! We can create one like so:

In [48]:
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)

Here's how it's displayed in a notebook:

In [49]:
ds

Dataset({
    features: ['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels'],
    num_rows: 6807
})

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- *Tokenization*: Split each text up into words (or actually, as we'll see, into *tokens*)
- *Numericalization*: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace "small" with "large" for a slower but more accurate model, once you've finished exploring):

In [50]:
model_nm = 'microsoft/deberta-v3-small'

`AutoTokenizer` will create a tokenizer appropriate for a given model:

In [51]:
import spacy

In [52]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)



Here's an example of how the tokenizer splits a text into "tokens" (which are like words, but can be sub-word pieces, as you see below):

In [62]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

Uncommon words will be split into pieces. The start of a new word is represented by `▁`:

In [54]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("G'day folks, I'm Jeremy from fast.ai!")
for token in doc:
  print(token.text)

G'day
folks
,
I
'm
Jeremy
from
fast.ai
!


In [90]:
tokz.tokenize('Nathalie Sylla')

['▁Nathalie', '▁S', 'ylla']

In [63]:
spac = [token.text for token in doc]
spac

["G'day", 'folks', ',', 'I', "'m", 'Jeremy', 'from', 'fast.ai', '!']

In [81]:
tokz.tokenize(row['full_text'])

['▁Design',
 '▁Thinking',
 '▁for',
 '▁innovation',
 '▁reflex',
 'ion',
 '-',
 'Av',
 'ril',
 '▁2021',
 '-',
 'N',
 'atha',
 'lie',
 '▁S',
 'ylla',
 '▁Challenge',
 '▁&',
 '▁selection',
 '▁The',
 '▁tool',
 '▁I',
 '▁use',
 '▁to',
 '▁help',
 '▁all',
 '▁stakeholders',
 '▁finding',
 '▁their',
 '▁way',
 '▁through',
 '▁the',
 '▁complexity',
 '▁of',
 '▁a',
 '▁project',
 '▁is',
 '▁the',
 '▁mind',
 '▁map',
 '.',
 '▁What',
 '▁exactly',
 '▁is',
 '▁a',
 '▁mind',
 '▁map',
 '?',
 '▁According',
 '▁to',
 '▁the',
 '▁definition',
 '▁of',
 '▁Buz',
 'an',
 '▁T',
 '.',
 '▁and',
 '▁Buz',
 'an',
 '▁B',
 '.',
 '▁(',
 '1999',
 ',',
 '▁Des',
 's',
 'ine',
 '-',
 'moi',
 '▁l',
 "'",
 'intelligence',
 '.',
 '▁Paris',
 ':',
 '▁Les',
 '▁É',
 'dition',
 's',
 '▁d',
 "'",
 'Organ',
 'isation',
 '.',
 ')',
 ',',
 '▁the',
 '▁mind',
 '▁map',
 '▁(',
 'or',
 '▁heuristic',
 '▁diagram',
 ')',
 '▁is',
 '▁a',
 '▁graphic',
 '▁representation',
 '▁technique',
 '▁that',
 '▁follows',
 '▁the',
 '▁natural',
 '▁functioning',
 '▁of',
 '

In [80]:
toktok = [tok for token in nlp(row['full_text']) for tok in tokz.tokenize(token.text)]
toktok

['▁Design',
 '▁Thinking',
 '▁for',
 '▁innovation',
 '▁reflex',
 'ion',
 '▁-',
 '▁Avril',
 '▁2021',
 '▁-',
 '▁Nathalie',
 '▁S',
 'ylla',
 '▁Challenge',
 '▁&',
 '▁selection',
 '▁The',
 '▁tool',
 '▁I',
 '▁use',
 '▁to',
 '▁help',
 '▁all',
 '▁stakeholders',
 '▁finding',
 '▁their',
 '▁way',
 '▁through',
 '▁the',
 '▁complexity',
 '▁of',
 '▁a',
 '▁project',
 '▁is',
 '▁the',
 '▁mind',
 '▁map',
 '▁.',
 '▁What',
 '▁exactly',
 '▁is',
 '▁a',
 '▁mind',
 '▁map',
 '▁?',
 '▁According',
 '▁to',
 '▁the',
 '▁definition',
 '▁of',
 '▁Buz',
 'an',
 '▁T',
 '.',
 '▁and',
 '▁Buz',
 'an',
 '▁B',
 '.',
 '▁(',
 '▁1999',
 '▁,',
 '▁Des',
 's',
 'ine',
 '▁-',
 '▁moi',
 '▁l',
 "'",
 'intelligence',
 '▁.',
 '▁Paris',
 '▁:',
 '▁Les',
 '▁É',
 'dition',
 's',
 '▁d',
 "'",
 'Organ',
 'isation',
 '▁.',
 '▁)',
 '▁,',
 '▁the',
 '▁mind',
 '▁map',
 '▁(',
 '▁or',
 '▁heuristic',
 '▁diagram',
 '▁)',
 '▁is',
 '▁a',
 '▁graphic',
 '▁representation',
 '▁technique',
 '▁that',
 '▁follows',
 '▁the',
 '▁natural',
 '▁functioning',
 '▁of',


In [None]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

In [None]:
row=ds[0]
len(tokz.tokenize(row['full_text'])), len(row['tokens'])

(724, 753)

NameError: name 'SpacyTokenizer' is not defined

In [None]:
tokz.tokenize(row['full_text']), row['tokens']

(['▁Design',
  '▁Thinking',
  '▁for',
  '▁innovation',
  '▁reflex',
  'ion',
  '-',
  'Av',
  'ril',
  '▁2021',
  '-',
  'N',
  'atha',
  'lie',
  '▁S',
  'ylla',
  '▁Challenge',
  '▁&',
  '▁selection',
  '▁The',
  '▁tool',
  '▁I',
  '▁use',
  '▁to',
  '▁help',
  '▁all',
  '▁stakeholders',
  '▁finding',
  '▁their',
  '▁way',
  '▁through',
  '▁the',
  '▁complexity',
  '▁of',
  '▁a',
  '▁project',
  '▁is',
  '▁the',
  '▁mind',
  '▁map',
  '.',
  '▁What',
  '▁exactly',
  '▁is',
  '▁a',
  '▁mind',
  '▁map',
  '?',
  '▁According',
  '▁to',
  '▁the',
  '▁definition',
  '▁of',
  '▁Buz',
  'an',
  '▁T',
  '.',
  '▁and',
  '▁Buz',
  'an',
  '▁B',
  '.',
  '▁(',
  '1999',
  ',',
  '▁Des',
  's',
  'ine',
  '-',
  'moi',
  '▁l',
  "'",
  'intelligence',
  '.',
  '▁Paris',
  ':',
  '▁Les',
  '▁É',
  'dition',
  's',
  '▁d',
  "'",
  'Organ',
  'isation',
  '.',
  ')',
  ',',
  '▁the',
  '▁mind',
  '▁map',
  '▁(',
  'or',
  '▁heuristic',
  '▁diagram',
  ')',
  '▁is',
  '▁a',
  '▁graphic',
  '▁repre

Here's a simple function which tokenizes our inputs:

In [None]:
def tok_func(x): return tokz(x["input"])

To run this quickly in parallel on every row in our dataset, use `map`:

In [None]:
tok_ds = ds.map(tok_func, batched=True)

This adds a new item to our dataset called `input_ids`. For instance, here is the input and IDs for the first row of our data:

In [None]:
row = tok_ds[0]
row['input'], row['input_ids']

So, what are those IDs and where do they come from? The secret is that there's a list called `vocab` in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the word "of":

In [None]:
tokz.vocab['▁of']

Looking above at our input IDs, we do indeed see that `265` appears as expected.

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name `labels`, but in our dataset it's currently `score`. Therefore, we need to rename it:

In [None]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

Now that we've prepared our tokens and labels, we need to create our validation set.

## Test and validation sets

You may have noticed that our directory contained another file:

In [None]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

This is the *test set*. Possibly the most important idea in machine learning is that of having separate training, validation, and test data sets.

### Validation set

To explain the motivation, let's start simple, and imagine we're trying to fit a model where the true relationship is this quadratic:

In [None]:
def f(x): return -3*x**2 + 2*x + 20

Unfortunately matplotlib (the most common library for plotting in Python) doesn't come with a way to visualize a function, so we'll write something to do this ourselves:

In [None]:
import numpy as np, matplotlib.pyplot as plt

def plot_function(f, min=-2.1, max=2.1, color='r'):
    x = np.linspace(min,max, 100)[:,None]
    plt.plot(x, f(x), color)

Here's what our function looks like:

In [None]:
plot_function(f)

For instance, perhaps we've measured the height above ground of an object before and after some event. The measurements will have some random error. We can use numpy's random number generator to simulate that. I like to use `seed` when writing about simulations like this so that I know you'll see the same thing I do:

In [None]:
from numpy.random import normal,seed,uniform
np.random.seed(42)

Here's a function `add_noise` that adds some random variation to an array:

In [None]:
def noise(x, scale): return normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)

Let's use it to simulate some measurements evenly distributed over time:

In [None]:
x = np.linspace(-2, 2, num=20)[:,None]
y = add_noise(f(x), 0.2, 1.3)
plt.scatter(x,y);

Now let's see what happens if we *underfit* or *overfit* these predictions. To do that, we'll create a function that fits a polynomial of some degree (e.g. a line is degree 1, quadratic is degree 2, cubic is degree 3, etc). The details of how this function works don't matter too much so feel free to skip over it if you like!  (PS: if you're not sure about the jargon around polynomials, here's a [great video](https://www.youtube.com/watch?v=ffLLmV4mZwU) which teaches you what you'll need to know.)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

def plot_poly(degree):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(x, y)
    plt.scatter(x,y)
    plot_function(model.predict)

So, what happens if we fit a line (a "degree 1 polynomial") to our measurements?

In [None]:
plot_poly(1)

As you see, the points on the red line (the line we fitted) aren't very close at all. This is *under-fit* -- there's not enough detail in our function to match our data.

And what happens if we fit a degree 10 polynomial to our measurements?

In [None]:
plot_poly(10)

Well now it fits our data better, but it doesn't look like it'll do a great job predicting points other than those we measured -- especially those in earlier or later time periods. This is *over-fit* -- there's too much detail such that the model fits our points, but not the underlying process we really care about.

Let's try a degree 2 polynomial (a quadratic), and compare it to our "true" function (in blue):

In [None]:
plot_poly(2)
plot_function(f, color='b')

That's not bad at all!

So, how do we recognise whether our models are under-fit, over-fit, or "just right"? We use a *validation set*. This is a set of data that we "hold out" from training -- we don't let our model see it at all. If you use the fastai library, it automatically creates a validation set for you if you don't have one, and will always report metrics (measurements of the accuracy of a model) using the validation set.

The validation set is *only* ever used to see how we're doing. It's *never* used as inputs to training the model.

Transformers uses a `DatasetDict` for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use `train_test_split`:

In [None]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

As you see above, the validation set here is called `test` and not `validate`, so be careful!

In practice, a random split like we've used here might not be a good idea -- here's what Dr Rachel Thomas has to say about it:

> "*One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a `train_test_split` method, this method takes a random subset of the data, which is a poor choice for many real-world problems.*"

I strongly recommend reading her article [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) to more fully understand this critical topic.

### Test set

So that's the validation set explained, and created. What about the "test set" then -- what's that for?

The *test set* is yet another dataset that's held out from training. But it's held out from reporting metrics too! The accuracy of your model on the test set is only ever checked after you've completed your entire training process, including trying different models, training methods, data processing, etc.

You see, as you try all these different things, to see their impact on the metrics on the validation set, you might just accidentally find a few things that entirely coincidentally improve your validation set metrics, but aren't really better in practice. Given enough time and experiments, you'll find lots of these coincidental improvements. That means you're actually over-fitting to your validation set!

That's why we keep a test set held back. Kaggle's public leaderboard is like a test set that you can check from time to time. But don't check too often, or you'll be even over-fitting to the test set!

Kaggle has a *second* test set, which is yet another held-out dataset that's only used at the *end* of the competition to assess your predictions. That's called the "private leaderboard". Here's a [great post](https://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/) about what can happen if you overfit to the public leaderboard.

We'll use `eval` as our name for the test set, to avoid confusion with the `test` dataset that was created above.

In [None]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

## Metrics and correlation

When we're training a model, there will be one or more *metrics* that we're interested in maximising or minimising. These are the measurements that should, hopefully, represent how well our model will works for us.

In real life, outside of Kaggle, things not easy... As my partner Dr Rachel Thomas notes in [The problem with metrics is a big problem for AI](https://www.fast.ai/2019/09/24/metrics/):

>  At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so. This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.

In Kaggle, however, it's very straightforward to know what metric to use: Kaggle will tell you! According to this competition's [evaluation page](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview/evaluation), "*submissions are evaluated on the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the predicted and actual similarity scores*." This coefficient is usually abbreviated using the single letter *r*. It is the most widely used measure of the degree of relationship between two variables.

r can vary between `-1`, which means perfect inverse correlation, and `+1`, which means perfect positive correlation. The mathematical formula for it is much less important than getting a good intuition for what the different values look like. To start to get that intuition, let's look at some examples using the [California Housing](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) dataset, which shows "*is the median house value for California districts, expressed in hundreds of thousands of dollars*". This dataset is provided by the excellent [scikit-learn](https://scikit-learn.org/stable/) library, which is the most widely used library for machine learning outside of deep learning.

In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
housing = housing['data'].join(housing['target']).sample(1000, random_state=52)
housing.head()

We can see all the correlation coefficients for every combination of columns in this dataset by calling `np.corrcoef`:

In [None]:
np.set_printoptions(precision=2, suppress=True)

np.corrcoef(housing, rowvar=False)

This works well when we're getting a bunch of values at once, but it's overkill when we want a single coefficient:

In [None]:
np.corrcoef(housing.MedInc, housing.MedHouseVal)

Therefore, we'll create this little function to just return the single number we need given a pair of variables:

In [None]:
def corr(x,y): return np.corrcoef(x,y)[0][1]

corr(housing.MedInc, housing.MedHouseVal)

Now we'll look at a few examples of correlations, using this function (the details of the function don't matter too much):

In [None]:
def show_corr(df, a, b):
    x,y = df[a],df[b]
    plt.scatter(x,y, alpha=0.5, s=4)
    plt.title(f'{a} vs {b}; r: {corr(x, y):.2f}')

OK, let's check out the correlation between income and house value:

In [None]:
show_corr(housing, 'MedInc', 'MedHouseVal')

So that's what a correlation of 0.68 looks like. It's quite a close relationship, but there's still a lot of variation. (Incidentally, this also shows why looking at your data is so important -- we can see clearly in this plot that house prices above $500,000 seem to have been truncated to that maximum value).

Let's take a look at another pair:

In [None]:
show_corr(housing, 'MedInc', 'AveRooms')

The relationship looks like it is similarly close to the previous example, but r is much lower than the income vs valuation case. Why is that? The reason is that there are a lot of *outliers* -- values of `AveRooms` well outside the mean.

r is very sensitive to outliers. If there's outliers in your data, then the relationship between them will dominate the metric. In this case, the houses with a very high number of rooms don't tend to be that valuable, so it's decreasing r from where it would otherwise be.

Let's remove the outliers and try again:

In [None]:
subset = housing[housing.AveRooms<15]
show_corr(subset, 'MedInc', 'AveRooms')

As we expected, now the correlation is very similar to our first comparison.

Here's another relationship using `AveRooms` on the subset:

In [None]:
show_corr(subset, 'MedHouseVal', 'AveRooms')

At this level, with r of 0.34, the relationship is becoming quite weak.

Let's look at one more:

In [None]:
show_corr(subset, 'HouseAge', 'AveRooms')

As you see here, a correlation of -0.2 shows a very weak negative trend.

We've seen now examples of a variety of levels of correlation coefficient, so hopefully you're getting a good sense of what this metric means.

Transformers expects metrics to be returned as a `dict`, since that way the trainer knows what label to use, so let's create a function to do that:

In [None]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

## Training

## Training our model

To train a model in Transformers we'll need this:

In [None]:
from transformers import TrainingArguments,Trainer

We pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly:

In [None]:
bs = 128
epochs = 4

The most important hyperparameter is the learning rate. fastai provides a learning rate finder to help you figure this out, but Transformers doesn't, so you'll just have to use trial and error. The idea is to find the largest value you can, but which doesn't result in training failing.

In [None]:
lr = 8e-5

Transformers uses the `TrainingArguments` class to set up arguments. Don't worry too much about the values we're using here -- they should generally work fine in most cases. It's just the 3 parameters above that you may need to change for different models.

In [None]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

We can now create our model, and `Trainer`, which is a class which combines the data and model together (just like `Learner` in fastai):

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

As you see, Transformers spits out lots of warnings. You can safely ignore them.

Let's train our model!

In [None]:
trainer.train();

Lots more warning from Transformers again -- you can ignore these as before.

The key thing to look at is the "Pearson" value in table above. As you see, it's increasing, and is already above 0.8. That's great news! We can now submit our predictions to Kaggle if we want them to be scored on the official leaderboard. Let's get some predictions on the test set:

In [None]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

Look out - some of our predictions are <0, or >1! This once again shows the value of remember to actually *look* at your data. Let's fix those out-of-bounds predictions:

In [None]:
preds = np.clip(preds, 0, 1)

In [None]:
preds

OK, now we're ready to create our submission file. If you save a CSV in your notebook, you will get the option to submit it later.

In [None]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Unfortunately this is a *code competition* and internet access is disabled. That means the `pip install datasets` command we used above won't work if you want to submit to Kaggle. To fix this, you'll need to download the pip installers to Kaggle first, as [described here](https://www.kaggle.com/c/severstal-steel-defect-detection/discussion/113195). Once you've done that, disable internet in your notebook, go to the Kaggle leaderboards page, and click the *Submission* button.

## The end

Once you're ready to go deeper, take a look at my [Iterate Like a Grandmaster](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster/) notebook.

Thanks for reading! This has been a bit of an experiment for me -- I've never done an "absolute beginners" guide before on Kaggle. I hope you like it! If you do, I'd greatly appreciate an upvote. Don't hesitate to add a comment if you have any questions or thoughts to add.