# Gathering, Profiling, and Cleaning Data

![ml workflow](https://docs.google.com/drawings/d/e/2PACX-1vQ-v8AikdWJxzh5WdNTi9dhv-J6YF4DbbFJ9YQbAKbnljVV0MozzUX5TGhJ1NhtRcJrKdu_sh2QC_hy/pub?w=1165&h=662)

Let's dive in and see how we can complete the various parts of this ML workflow with Go. In particular, let's look at how we can import, parse, manipulate, and profile data with Go. Note, there are innumerable types and formats of data that you might have to deal with in an ML/AI workflow (CSV, JSON, Parquet, Avro, etc.), and we won't cover all of them. Rather, we will highlight a few of the main Go packages that you can utilize for data gathering, profiling, and cleaning.

We will look at two different example data sets in this example notebook:
- an [emoji data set](https://www.kaggle.com/sanjayaw/emosim508) in JSON format
- a [Game of Thrones data set](https://github.com/chrisalbon/war_of_the_five_kings_dataset) in CSV format, and

## Import Libraries

In [None]:
import (
    "os"
    "fmt"
    "encoding/csv"
    "encoding/json"
    "io/ioutil"
    "strings"
    "strconv"
    
    "gonum.org/v1/plot"
    "gonum.org/v1/plot/plotter"
    "gonum.org/v1/plot/plotutil"
    "gonum.org/v1/plot/vg"
    "github.com/kniren/gota/dataframe"
)

## Loading and parsing JSON data

This portion of the example will utilize an [emoji data set called EmoSim508](https://www.kaggle.com/sanjayaw/emosim508) to illustrate various JSON gathering, parsing, and manipulation techniques. EmoSim508 is the largest emoji similarity dataset that provides emoji similarity scores for 508 carefully selected emoji pairs. The most frequently co-occurring emoji pairs in a tweet corpus (that contains 147 million tweets) was used for creating the dataset and each emoji pair was annotated for its similarity using 10 human annotators. EmoSim508 dataset also consists of the emoji similarity scores generated from 8 different emoji embedding models proposed in "A Semantics-Based Measure of Emoji Similarity" paper by Wijeratne et al. 

We will illustrate parsing the JSON file with stdlib's `encoding/json`, along with some basic manipulations. 

### Data import and parsing with `encoding/json`

In [None]:
// First we need to create structs that define the
// structure of the JSON that we expect.
type emoji struct{
    Unicodelong  string `json:"unicodelong"`
    Unicodeshort string `json:"unicodeshort"`
    Title        string `json:"title"`
}

type emojiTuple struct{
    EmojiOne emoji `json:"emojiOne"`
    EmojiTwo emoji `json:"emojiTwo"`
}

type emojiSimilarityMetrics struct{
    Google_Sense_Label        float32
    Twitter_Sense_Def         float32
    Google_Sense_All          float32
    Google_Sense_Def          float32
    Google_Sense_Desc         float32
    Twitter_Sense_All         float32
    Twitter_Sense_Desc        float32
    Twitter_Sense_Label       float32
    Human_Annotator_Agreement float32 
}

type emojiSim struct{
    EmojiPairId         string                 `json:"emojiPairId"`
    EmojiPair           emojiTuple             `json:"emojiPair"`
    EmojiPairSimilarity emojiSimilarityMetrics `json:"emojiPairSimilarity`
}

In [None]:
// Load the JSON file.
emojiFile, err := ioutil.ReadFile("../data/EmoSim508.json")
if err != nil {
    fmt.Println(err)
}

// Create an emojiSim value to hold the parsed data.
var emojisims []emojiSim

// Unmarshall the data from the file.
if err = json.Unmarshal(emojiFile, &emojisims); err != nil {
    fmt.Println(err)
}

In [None]:
// Pretty print one of the records to see what
// they look like.
firstData, err := json.MarshalIndent(emojisims[0], "", "  ")
if err != nil{
    fmt.Println(err)
}
fmt.Println(string(firstData))

### Data manipulation

Let's select out all of the emoji pairs where the `Human_Annotator_Agreement` is greater than 3.5, meaning that the emojis are very similar.

In [None]:
// Create a slice of emojiTuple to hold the selected pairs.
var resultPairs []emojiTuple

// Loop over the parsed emoji data selecting out the emojis.
for _, val := range emojisims {
    if val.EmojiPairSimilarity.Human_Annotator_Agreement > 3.5{ 
        resultPairs = append(resultPairs,val.EmojiPair)
    }
}

In [None]:
// Let's see how many of the emoji pairs satisfy this requirement.
fmt.Printf("Similar emojis count: %d", len(resultPairs))

### Data output

Now let's save all of the similar emoji pairs to an output data file. We can utilize `encoding/json` for this as well.

In [None]:
// Marshall the data in a pretty printed format.
jsonString, err := json.MarshalIndent(resultPairs, "", "  ")
if err != nil {
    fmt.Println(err)
}

// Write the data out to a file.
if err = ioutil.WriteFile("similar_emojis.json", jsonString, 0755); err != nil {
    fmt.Println(err)
}

## Loading and parsing CSV data

This portion of the example will utilize a [Game of Thrones data set](https://github.com/chrisalbon/war_of_the_five_kings_dataset) to illustrate various CSV gathering, parsing, and manipulation techniques. The data set represents the battles in the War of the Five Kings from George R.R. Martin's A Song Of Ice And Fire series.

We will first illustrate parsing the CSV file with stdlib's `encoding/csv`, along with some basic manipulations. Then we will utilize a couple of third party packages to further profile the data and build up some intuition.

### Data import with `encoding/csv`

In [None]:
// Open the csv file at ../../data/5kings_battles_v1.csv.
file, err := os.Open("../data/5kings_battles_v1.csv")
if err != nil {
    fmt.Println(err)
}

In [None]:
// Create a new CSV reader.
reader := csv.NewReader(file)

// Read in all the records via the CSV reader method ReadAll.
records, err := reader.ReadAll()
if err != nil {
    fmt.Println(err)
}

// Close the file.
file.Close()

In [None]:
// Let's get a sense of what the records look like
// by printing a few of them.
for idx, record := range records {
    // Examine the header row.
    if idx == 0 {
        fmt.Println("Header: ", record)
    }else{
        // Print a few of the actual records.
        fmt.Printf("\nname: %s\nyear: %s\nattacker_king: %s\ndefender_king: %s\nattacker_1: %s\n", record[0], record[1], record[3], record[4], record[5])
        if idx > 5 {
            break
        }
    }
}

### Basic data parsing and manipulations with stdlib

Many times when prepping data for ML/AI models, we are interested in only a subset of the features/labels. In addition, you will notice that the data imported via `encoding/csv` is all represented as slices of strings. As such, lets':

1. Create a new slice of structs with only the fields of interest, and
2. Parse certain fields into numerical values.

In [None]:
// Define a struct with the fields we want to keep.
type Battle struct{
    Name string
    Year int
    AttackerWin bool
    AttackerSize int
    DefenderSize int
    Region string
}

In [None]:
// Create a slice of Battle.
var battles []Battle

battles := make ([]Battle,len(records)-1)

// Loop over the records.
for idx, record := range records {
    
    // Skip the header row.
    if idx !=0{
        // Create a Battle value.
        battle := Battle{
            Name: record[0],
            Region: record[21],
        }

        // Parse the year.
        year, err := strconv.Atoi(record[1])
        if err != nil {
            fmt.Println(err)
            break
        }
        battle.Year = year

        // Parse the outcome.
        var attackerWin bool
        if record[11] == "win" {
            attackerWin = true
        }
        battle.AttackerWin = attackerWin

        // Parse the attacker size.
        if record[15] != "" {
            attackerSize, err := strconv.Atoi(record[15])
            if err != nil {
                fmt.Println(err)
                break
            }
            battle.AttackerSize = attackerSize
        }
        if record[16] != "" {
            defenderSize, err := strconv.Atoi(record[16])
            if err != nil {
                fmt.Println(err)
                break
            }
            battle.DefenderSize = defenderSize
        }

        // Add the data to our new slice.
        battles[idx-1] = battle
    }
}

In [None]:
// Output a couple of the parsed battles to stdout.
fmt.Println(battles[0])
fmt.Println(battles[1])

### Basic profiling with stdlib

To count the number of battles in each year and each region observed in the battle data:

In [None]:
// Create a map to hold the frequencies.
yearFrequencies := make(map[int]int)
regionFrequencies := make(map[string]int)

// Loop over records.
for _, battle := range battles {
    
    // Increment a counter for the relevant year.
    yearFrequencies[battle.Year]++
    
    // Increment a counter for the relevant region.
    regionFrequencies[battle.Region]++
}

In [None]:
// Output the year results to stdout.
fmt.Println("Battles per year:\n-------------------------------\n")
for k, v := range yearFrequencies {
    fmt.Printf("year: %d\ncount: %d\n\n", k, v)
}

fmt.Println("Battles per region:\n-------------------------------\n")
for k, v := range regionFrequencies {
    fmt.Printf("region: %s\ncount: %d\n\n", k, v)
}

### Profiling and visualization with third-party packages

Seeing some of the counts above, gives us a little intuition about the text fields in the battle data. However, before moving forward with the data, we should make sure that we have a sense about the centrality and spread of our numerical data. Essentially this means that we would like to get some intuition about (1) where are most of our numerical values, and (2) how are those values spread out across their range. 

To figure this out, let's use a convenient DataFrame package called `gota`. 

In [None]:
// Open the CSV file.
f, err := os.Open("../data/5kings_battles_v1.csv")
if err != nil {
    fmt.Println(err)
}

// Create a dataframe from the CSV file.
// The types of the columns will be inferred.
battleDF := dataframe.ReadCSV(f)
f.Close()

// Select out the columns that we would like to use.
battleDF = battleDF.Select([]string{"name", "year", "attacker_size", "defender_size", "region"})

// Show the nice structure that gota inferred from our data.
fmt.Println(battleDF)

In [None]:
// Now let's output the statistics that
// will give us some intuition about centrality
// and spread of the numerical columns.
battleDF.Select([]string{"year", "attacker_size", "defender_size"}).Describe()

Note that we can clearly see the statistical measures for the `year` column, but we see a bunch of NaN's in the size columns. This is a result of having missing values in those columns, which is also good to know as we are profiling our data. We will attempt to deal with these missing values in an exercise.

Now, numbers are great, and they can give us some intuition about our data, but visualizations of the data are also super useful. Let's use `gonum.org/v1/plot` to create some plots showing various aspects of our data.

In [None]:
// Create a new plot value that will allow us to
// plot the battle counts per region. 
p, err := plot.New()
if err != nil {
    fmt.Println(err)
}

// Label the plot and the y axis.
p.Title.Text = "battles per region"
p.Y.Label.Text = "count"

In [None]:
// Create a plotter.Values value that will contain
// the data we want to plot.
counts := make(plotter.Values, len(regionFrequencies))

// Create a slice of strings to contain our region names.
regions := make([]string, len(regionFrequencies))

// Loop over our parsed data (the data we parsed into structs)
// to extract the regions and counts corresponding
// to those regions.
idx := 0
for  key, value := range regionFrequencies {
    
    // Convert the integer to floats, which is required
    // to generate the plot.
    counts[idx] = float64(value)
    
    // Extract the region and take care of any new line characters.
    regions[idx] = strings.Replace(key," ","\n",-1)
    
    // Advance the index.
    idx++
}

In [None]:
// Create a new bar chart.
w := vg.Points(20)
bars, err := plotter.NewBarChart(counts, w)
if err != nil {
	fmt.Println(err)
}

// Add the bars to the plot.
p.Add(bars)

// Add the x labels.
p.NominalX(regions...)

// Save the plot.
if err := p.Save(6*vg.Inch, 3*vg.Inch, "barchart.png"); err != nil {
	fmt.Println(err)
}

In [None]:
// Open the plot and display it inline
f, err := os.Open("barchart.png")
if err != nil {
    fmt.Println(err)
}

plotBytes, err := ioutil.ReadAll(f)
if err != nil {
    fmt.Println(err)
}
f.Close()

display.PNG(plotBytes)