# Tablesaw 

[Tablesaw](https://tablesaw.tech/) provides the ability to easily transform, summarize, and filter data, as well as computing descriptive statistics. It can also be used easily with libraries like Smile, which provides fundamental machine learning algorithms.

This notebook has some basic demos of how to use Tablesaw, including visualizing the results for which it uses the BeakerX interactive visualization APIs. Tablesaw also provides its own visualization APIs if you wish to do visualization outside of BeakerX. The notebook covers basic table manipulation, k-means clustering, linear regression, and fetching financial data.

In [39]:
%%classpath add mvn
tech.tablesaw tablesaw-beakerx 0.30.3
com.jimmoores quandl-tablesaw 2.0.0
com.github.haifengl smile-core 1.5.2

In [40]:
%import static tech.tablesaw.aggregate.AggregateFunctions.*
%import tech.tablesaw.api.*
%import tech.tablesaw.columns.*
%import smile.clustering.*
%import smile.regression.*

// display Tablesaw tables with BeakerX table display widget
tech.tablesaw.beakerx.TablesawDisplayer.register()

null

In [41]:
tornadoes = Table.read().csv("./data/tornadoes_2014.csv")

In [42]:
//print dataset structure
tornadoes.structure()

In [43]:
//get header names
tornadoes.columnNames()

[Date, Time, State, State No, Scale, Injuries, Fatalities, Start Lat, Start Lon, Length, Width]

In [44]:
//displays the row and column counts
tornadoes.shape()

908 rows X 11 cols

In [45]:
//displays the first n rows
tornadoes.first(10)

In [46]:
tornadoes.structure()

In [47]:
//summarize the data in each column
tornadoes.summary()


Table summary for: tornadoes_2014.csv
       Column: Date        
 Measure   |    Value     |
---------------------------
    Count  |         908  |
  Missing  |           0  |
 Earliest  |  2014-01-11  |
   Latest  |  2014-12-29  |
     Column: Time     
 Measure   |  Value  |
----------------------
    Count  |    908  |
  Missing  |      0  |
 Earliest  |  00:01  |
   Latest  |  23:59  |
    Column: State     
 Category  |  Count  |
----------------------
       WY  |     13  |
       PA  |      9  |
       OK  |     17  |
       IA  |     56  |
       SC  |      7  |
       AL  |     55  |
       LA  |     15  |
       NM  |     15  |
       DE  |      1  |
       MD  |      2  |
      ...  |    ...  |
       WV  |      9  |
       MS  |     42  |
       WI  |     22  |
       MO  |     47  |
       IN  |     28  |
       NH  |      2  |
       GA  |     32  |
       IL  |     49  |
       VA  |     12  |
   Column: State No   
 Measure   |  Value  |
----------------------
      

In [48]:
//Mapping operations
def month = tornadoes.dateColumn("Date").month()
tornadoes.addColumns(month);
tornadoes.columnNames()

[Date, Time, State, State No, Scale, Injuries, Fatalities, Start Lat, Start Lon, Length, Width, Date month]

In [49]:
//Sorting by column
tornadoes.sortOn("-Fatalities")

In [50]:
//Descriptive statistics
tornadoes.column("Fatalities").summary()

In [51]:
//Performing totals and sub-totals
def injuriesByScale = tornadoes.summarize("Injuries", median).by("Scale")
injuriesByScale.setName("Median injuries by Tornado Scale")
injuriesByScale

In [52]:
//Cross Tabs
tornadoes.xTabCounts("State", "Scale")

## K-means clustering

K-means is the most common form of “centroid” clustering. Unlike classification, clustering is an unsupervised learning method. The categories are not predetermined. Instead, the goal is to search for natural groupings in the dataset, such that the members of each group are similar to each other and different from the members of the other groups. The K represents the number of groups to find.

We’ll use a well known Scotch Whiskey dataset, which is used to cluster whiskeys according to their taste based on data collected from tasting notes. As always, we start by loading data and printing its structure.

In [53]:
whiskeyData = Table.read().csv("./data/whiskey.csv")
whiskeyData.structure()

In [54]:
kMeans = new KMeans(whiskeyData.as().doubleMatrix("Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"), 5)

K-Means distortion: 387.52701
Clusters of 86 data points of dimension 12:
  0	   24 (27.9%)
  1	   16 (18.6%)
  2	   23 (26.7%)
  3	    6 ( 7.0%)
  4	   17 (19.8%)


In [55]:
Table whiskeyClusters = Table.create("Clusters", whiskeyData.stringColumn("Distillery"), DoubleColumn.create("Cluster", kMeans.getClusterLabel()));
whiskeyClusters = whiskeyClusters.sortAscendingOn("Cluster", "Distillery");

## Play (Money)ball with Linear Regression

In baseball, you make the playoffs by winning more games than your rivals. The number of games the rivals win is out of your control so the A’s looked instead at how many wins it took historically to make the playoffs. They decided that 95 wins would give them a strong chance.  Here’s how we might check that assumption in Tablesaw.

In [65]:
baseball = Table.read().csv("./data/baseball.csv");

// filter to the data available at the start of the 2002 season
moneyball = baseball.where(baseball.numberColumn("year").isLessThan(2002));
wins = moneyball.nCol("W");
year = moneyball.nCol("Year");
playoffs = moneyball.column("Playoffs");
runDifference = moneyball.numberColumn("RS").subtract(moneyball.numberColumn("RA")).setName("RD");
moneyball.addColumns(runDifference);

def Plot = new Plot(title: "RD x Wins", xLabel:"RD", yLabel: "W")
Plot << new Points(x: moneyball.numberColumn("RD").asDoubleArray(), y: moneyball.numberColumn("W").asDoubleArray())

In [66]:
winsModel = new OLS(moneyball.select("W", "RD").smile().numericDataset("RD"));

Linear Model:

Residuals:
	       Min	        1Q	    Median	        3Q	       Max
	 -115.1010	  -24.7084	   -0.8748	   23.9474	  110.2269

Coefficients:
            Estimate        Std. Error        t value        Pr(>|t|)
Intercept  -673.5757            8.3409       -80.7558          0.0000 ***
W	      8.3279            0.1021        81.5536          0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.9536 on 900 degrees of freedom
Multiple R-squared: 0.8808,    Adjusted R-squared: 0.8807
F-statistic: 6650.9926 on 1 and 900 DF,  p-value: 0.000


In [67]:
runsScored = new OLS(moneyball.select("OBP", "SLG", "RS").smile().numericDataset("RS"));

Linear Model:

Residuals:
	       Min	        1Q	    Median	        3Q	       Max
	  -70.8379	  -17.1810	   -1.0917	   16.7812	   90.0358

Coefficients:
            Estimate        Std. Error        t value        Pr(>|t|)
Intercept  -804.6271           18.9208       -42.5261          0.0000 ***
OBP	   2737.7682           90.6846        30.1900          0.0000 ***
SLG	   1584.9085           42.1556        37.5966          0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.7900 on 899 degrees of freedom
Multiple R-squared: 0.9296,    Adjusted R-squared: 0.9294
F-statistic: 5933.7256 on 2 and 899 DF,  p-value: 0.000


In [68]:
new Histogram(xLabel:"X",
              yLabel:"Proportion",
              data: Arrays.asList(runsScored.residuals()),
              binCount: 25);

## Financial and Economic Data

You can fetch data from [Quandl](https://www.quandl.com/) and load it directly into Tablesaw

In [69]:
%import com.jimmoores.quandl.*
%import com.jimmoores.quandl.tablesaw.*

In [70]:
TableSawQuandlSession session = TableSawQuandlSession.create();
Table table = session.getDataSet(DataSetRequest.Builder.of("WIKI/AAPL").build());
// Create a new column containing the year
NumberColumn yearColumn = table.dateColumn("Date").year();
yearColumn.setName("Year");
table.addColumns(yearColumn);
// Create max, min and total volume tables aggregated by year
Table summaryMax = table.summarize("Adj. Close", max).by("year");
Table summaryMin = table.summarize("Adj. Close", min).by("year");
Table summaryVolume = table.summarize("Volume", sum).by("year");
// Create a new table from each of these
summary = Table.create("Summary", summaryMax.column(0), summaryMax.column(1), 
                       summaryMin.column(1), summaryVolume.column(1));
// Add back a DateColumn to the summary...will be used for plotting
DateColumn yearDates = DateColumn.create("YearDate");
for (year in summary.column('Year')) {
    yearDates.append(java.time.LocalDate.of((int) year, 1, 1));
}
summary.addColumns(yearDates)

summary

In [71]:
years = summary.column('YearDate').collect()

plot = new TimePlot(title: 'Price Chart for AAPL', xLabel: 'Time', yLabel: 'Max [Adj. Close]')
plot << new YAxis(label: 'Volume')
plot << new Points(x: years, y: summary.column('Max [Adj. Close]').collect())
plot << new Line(x: years, y: summary.column('Max [Adj. Close]').collect(), color: Color.blue)
plot << new Stems(x: years, y: summary.column('Sum [Volume]').collect(), yAxis: 'Volume')