`Smile` provides an equivalant functionality of pandas for `kotlin`. See [documentation](https://github.com/haifengl/smile) 

In [1]:
@file:Repository("https://repo1.maven.org/maven2/")
@file:DependsOn("com.github.haifengl:smile-core:2.4.0")
import smile.*
import smile.data.*
import smile.io.*
import org.apache.commons.csv.CSVFormat;

# Loading data files


Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

* mpg : miles per gallon
* class : car classification
* cty : city mpg
* cyl : # of cylinders
* displ : engine displacement in liters
* drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
* fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
* hwy : highway mpg
* manufacturer : automobile manufacturer
* model : model of car
* trans : type of transmission
* year : model year



In [2]:
// Read the file
val df = Read.csv("../data/mpg.csv", CSVFormat.DEFAULT.withFirstRecordAsHeader())

In [3]:
df.summary()

[column: String, count: long, min: double, avg: double, max: double]
+------+-----+----+---------+----+
|column|count| min|      avg| max|
+------+-----+----+---------+----+
|    id|  234|   1|    117.5| 234|
| displ|  234| 1.6| 3.471795|   7|
|  year|  234|1999|   2003.5|2008|
|   cyl|  234|   4| 5.888889|   8|
|   cty|  234|   9|16.858974|  35|
|   hwy|  234|  12|23.440171|  44|
+------+-----+----+---------+----+


In [4]:
// View the number of rows in the dataframes
df.size()

234

In [5]:
// View the column names 
df.schema()

[id: int, manufacturer: String, model: String, displ: double, year: int, cyl: int, trans: String, drv: String, cty: int, hwy: int, fl: String, class: String]

In [6]:
// This is how to find the average cty fuel economy across all cars. 
df.column("cty").toIntArray().average()

16.858974358974358

In [7]:
//Similarly this is how to find the average hwy fuel economy across all cars.
df.column("hwy").toIntArray().average()

23.44017094017094

In [8]:
//  return the unique values for the number of cylinders the cars in our dataset have.
df.column("cyl").toIntArray().distinct()

[4, 6, 8, 5]

In [9]:
// Select only a few colums
df.select("manufacturer", "model")

[manufacturer: String, model: String]
+------------+----------+
|manufacturer|     model|
+------------+----------+
|        audi|        a4|
|        audi|        a4|
|        audi|        a4|
|        audi|        a4|
|        audi|        a4|
|        audi|        a4|
|        audi|        a4|
|        audi|a4 quattro|
|        audi|a4 quattro|
|        audi|a4 quattro|
+------------+----------+
224 more rows...


In [10]:
// Filtering data:
// Finding a data point on a stream:

df.stream().filter({row -> row.getInt("cyl") == 4}).findFirst()

Optional[{
  id: 1,
  manufacturer: audi,
  model: a4,
  displ: 1.8,
  year: 1999,
  cyl: 4,
  trans: auto(l5),
  drv: f,
  cty: 18,
  hwy: 29,
  fl: p,
  class: compact
}]

In [11]:
DataFrame.of(df.stream().filter({row -> row.getInt("cyl") == 4 && row.getString("manufacturer") == "audi"}))

[id: int, manufacturer: String, model: String, displ: double, year: int, cyl: int, trans: String, drv: String, cty: int, hwy: int, fl: String, class: String]
+---+------------+----------+-----+----+---+----------+---+---+---+---+-------+
| id|manufacturer|     model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+---+------------+----------+-----+----+---+----------+---+---+---+---+-------+
|  1|        audi|        a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|  2|        audi|        a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|  3|        audi|        a4|    2|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|  4|        audi|        a4|    2|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|  8|        audi|a4 quattro|  1.8|1999|  4|manual(m5)|  4| 18| 26|  p|compact|
|  9|        audi|a4 quattro|  1.8|1999|  4|  auto(l5)|  4| 16| 25|  p|compact|
| 10|        audi|a4 quattro|    2|2008|  4|manual(m6)|  4| 20| 28|  p|compact|
| 11|        audi|a4 quattro|    2|2008|  

In [12]:
// return the unique values for the class types in our dataset.
df.column("class").toStringArray().distinct()

[compact, midsize, suv, 2seater, minivan, pickup, subcompact]

In [13]:
// And here's an example of how to group entries by feature

df.stream().collect(java.util.stream.Collectors.groupingBy({row: Tuple -> row.getString("class")})

{midsize=[{
  id: 16,
  manufacturer: audi,
  model: a6 quattro,
  displ: 2.8,
  year: 1999,
  cyl: 6,
  trans: auto(l5),
  drv: 4,
  cty: 15,
  hwy: 24,
  fl: p,
  class: midsize
}, {
  id: 17,
  manufacturer: audi,
  model: a6 quattro,
  displ: 3.1,
  year: 2008,
  cyl: 6,
  trans: auto(s6),
  drv: 4,
  cty: 17,
  hwy: 25,
  fl: p,
  class: midsize
}, {
  id: 18,
  manufacturer: audi,
  model: a6 quattro,
  displ: 4.2,
  year: 2008,
  cyl: 8,
  trans: auto(s6),
  drv: 4,
  cty: 16,
  hwy: 23,
  fl: p,
  class: midsize
}, {
  id: 33,
  manufacturer: chevrolet,
  model: malibu,
  displ: 2.4,
  year: 1999,
  cyl: 4,
  trans: auto(l4),
  drv: f,
  cty: 19,
  hwy: 27,
  fl: r,
  class: midsize
}, {
  id: 34,
  manufacturer: chevrolet,
  model: malibu,
  displ: 2.4,
  year: 2008,
  cyl: 4,
  trans: auto(l4),
  drv: f,
  cty: 22,
  hwy: 30,
  fl: r,
  class: midsize
}, {
  id: 35,
  manufacturer: chevrolet,
  model: malibu,
  displ: 3.1,
  year: 1999,
  cyl: 6,
  trans: auto(l4),
  drv: f,


The original notebook also explains basics of `Classes`, `Objects`, datetime functions which are straightforward in `kotlin` and out of scope for this notebook