## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
# File location and type
#file_location = "/FileStore/tables/health_factors.csv"
file_location = "/FileStore/tables/Train_X.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

_c0,Active.Physicians.per.100.000.Population..2018..AAMC.,Total.Active.Patient.Care.Physicians.per.100.000.Population..2018..AAMC.,Active.Primary.Care.Physicians.per.100.000.Population..2018..AAMC.,Active.Patient.Care.Primary.Care.Physicians.per.100.000.Population..2018..AAMC.,Active.General.Surgeons.per.100.000.Population..2018..AAMC.,Active.Patient.Care.General.Surgeons.per.100.000.Population..2018..AAMC.,Percentage.of.Active.Physicians.Who.Are.Female..2018..AAMC.,Percentage.of.Active.Physicians.Who.Are.International.Medical.Graduates..IMGs...2018..AAMC.,Percentage.of.Active.Physicians.Who.Are.Age.60.or.Older..2018..AAMC.,MD.and.DO.Student.Enrollment.per.100.000.Population..AY.2018.2019..AAMC.,Student.Enrollment.at.Public.MD.and.DO.Schools.per.100.000.Population..AY.2018.2019..AAMC.,Percentage.Change.in.Student.Enrollment.at.MD.and.DO.Schools..2008.2018..AAMC.,Percentage.of.MD.Students.Matriculating.In.State..AY.2018.2019..AAMC.,Total.Residents.Fellows.in.ACGME.Programs.per.100.000.Population.as.of.December.31..2018..AAMC.,Total.Residents.Fellows.in.Primary.Care.ACGME.Programs.per.100.000.Population.as.of.Dec..31..2018..AAMC.,Percentage.of.Residents.in.ACGME.Programs.Who.Are.IMGs.as.of.December.31..2018..AAMC.,Ratio.of.Residents.and.Fellows..GME..to.Medical.Students..UME...AY.2017.2018..AAMC.,Percent.Change.in.Residents.and.Fellows.in.ACGME.Accredited.Programs..2008.2018..AAMC.,Percentage.of.Physicians.Retained.in.State.from.Undergraduate.Medical.Education..UME...2018..AAMC.,All.Specialties..AAMC.,State.Local.Government.hospital.beds.per.1000.people..2019.,Non.profit.hospital.beds.per.1000.people..2019.,For.profit.hospital.beds.per.1000.people..2019.,Total.hospital.beds.per.1000.people..2019.,Total.nurse.practitioners..2019.,Total.physician.assistants..2019.,Total.Hospitals..2019.,Total.Primary.Care.Physicians..2019.,Surgery.specialists..2019.,Emergency.Medicine.specialists..2019.,Total.Specialist.Physicians..2019.,ICU.Beds,Length.of.Life.rank,Quality.of.Life.rank,Health.Behaviors.rank,Clinical.Care.rank,Social...Economic.Factors.rank,Physical.Environment.rank,Adult.smoking.percentage,Adult.obesity.percentage,Excessive.drinking.percentage,Population.per.sq.mile,House.per.sq.mile,Share.of.Tests.with.Positive.COVID.19.Results,Number.of.Tests.with.Results.per.1.000.Population
657,284.4,240.5,99.6,87.2,7.0,5.8,38.7,30.7,30.4,44.7,13.4,2.5,62.6,49.1,19.8,25.9,1.2,7.1,30.9,2.8443428657820573,0.1,2.2,0.2,2.5,0.3895274183889149,0.195587815171442,0.014676934769845,1.72614880342884,0.161132337306247,0.2076747026186002,1.678350657585721,0.0,31.0,37.0,14.0,56.0,26.0,21.0,15.0,26.0,20.0,32.6,14.2,0.1730789999999999,38.6
1331,302.7,265.0,104.9,96.0,7.4,6.6,37.2,17.2,27.9,24.2,18.1,29.7,54.3,41.9,13.4,17.5,1.7,7.0,52.0,3.026814863838211,0.3,2.2,0.4802056555269882,2.5,0.5132611167941791,0.3229267859932588,0.0226333895457706,1.566159268631962,0.1769681559039109,0.1673445099245626,1.6721975897490773,0.0,15.0,43.0,39.0,46.0,58.0,65.0,15.0,33.0,20.0,45.2,18.7,0.105847,21.6
597,284.4,240.5,99.6,87.2,7.0,5.8,38.7,30.7,30.4,44.7,13.4,2.5,62.6,49.1,19.8,25.9,1.2,7.1,30.9,2.844342866671898,0.1,2.2,0.2,2.5,0.3895274183865651,0.1955878151534175,0.0146769347681079,1.72614880365691,0.1611323372832143,0.2076747025818096,1.678350657812132,0.1569489131287766,20.0,11.0,36.0,36.0,14.0,96.0,15.0,35.0,22.0,119.8,47.8,0.1730789999999999,38.6
1753,375.1,304.1,112.9,94.8,10.2,7.4,38.9,37.0,34.8,58.2,15.6,33.4,67.0,88.8,33.1,37.6,1.6,9.0,35.9,3.7507530497692407,0.5,2.2,0.4802056555269882,2.7,0.6453211098914806,0.4550662619433703,0.0084944337782212,2.0771449123113386,0.2196783383060995,0.2333410721466883,2.5026853411500567,0.1247349382562055,25.0,9.0,40.0,41.0,12.0,36.0,16.0,38.0,21.0,71.1,30.3,0.270601,64.7
1205,287.0,249.7,97.8,87.6,7.9,6.7,35.0,29.2,31.6,52.9,49.9,56.1,71.5,69.0,23.6,26.8,1.4,49.5,43.4,2.8703725478897373,0.1,2.2,0.3,2.5,0.4421806307428749,0.3784545986606448,0.0144058847998754,1.8523566874318644,0.2099857791621242,0.334036453823392,2.031129716555053,0.0311477962934122,14.0,25.0,58.0,58.0,18.0,74.0,18.0,35.0,22.0,111.9,43.4,0.1569809999999999,30.9
2888,263.4,231.6,100.1,89.4,10.5,9.2,29.7,27.5,33.5,89.2,89.2,13.8,88.9,51.8,23.0,30.2,0.6,44.1,30.4,2.6342428307987067,0.4,2.8,0.7,3.8,0.5216432094359956,0.36991259441983,0.0310106366423182,1.5959402646389655,0.1799724448329541,0.203784183690576,1.5521931165129923,0.8861214225841216,25.0,9.0,24.0,1.0,6.0,47.0,22.0,35.0,13.0,420.0,200.1,0.020966,37.2
2927,264.9,238.3,94.6,86.4,7.9,7.1,35.0,19.3,29.2,30.4,13.1,14.8,70.4,34.6,12.0,16.3,1.1,17.9,36.5,2.64880362584807,0.0,2.0,0.1,2.1,0.523086682793905,0.3165009853186519,0.0228775168501835,1.4617529195862529,0.155670321543766,0.1529181391391391,1.5819888922255587,0.0,71.0,66.0,69.0,67.0,67.0,1.0,19.0,29.0,24.0,9.2,8.8,0.085163,22.0
570,284.4,240.5,99.6,87.2,7.0,5.8,38.7,30.7,30.4,44.7,13.4,2.5,62.6,49.1,19.8,25.9,1.2,7.1,30.9,2.844342865408664,0.1,2.2,0.2,2.5,0.3895274184818112,0.1955878151072288,0.0146769347667842,1.7261488037777406,0.1611323372902551,0.2076747025775985,1.678350657931575,0.0,16.0,30.0,33.0,77.0,54.0,61.0,15.0,33.0,20.0,193.0,71.1,0.1730789999999999,38.6
632,284.4,240.5,99.6,87.2,7.0,5.8,38.7,30.7,30.4,44.7,13.4,2.5,62.6,49.1,19.8,25.9,1.2,7.1,30.9,2.844342865880297,0.1,2.2,0.2,2.5,0.3895274183777487,0.1955878151740206,0.0146769347604485,1.7261488038444734,0.1611323372943061,0.2076747026066695,1.6783506579292269,0.0,1.0,4.0,8.0,7.0,1.0,60.0,13.0,36.0,22.0,85.6,34.8,0.1730789999999999,38.6
2550,224.8,199.9,72.9,66.5,6.2,5.4,35.2,25.9,27.9,28.3,24.3,55.2,84.9,29.9,10.1,20.8,1.1,23.1,59.7,2.248392045748384,0.3,1.0,1.0,2.3,0.3449255614122327,0.2035757631526603,0.0182218251118846,1.0697570139234212,0.1211420380905022,0.1135467075584286,1.181039058677275,0.1491795126802585,24.0,44.0,59.0,61.0,57.0,173.0,13.0,36.0,18.0,19.9,10.7,0.072184,20.3


In [3]:
from pyspark import SparkContext

#sc = SparkContext()

In [4]:
df.printSchema()


In [5]:
df.head(5)

In [6]:
df.show(2,truncate= True)

In [7]:
df.count()

In [8]:
len(df.columns), df.columns

In [9]:
df.describe().show()

In [10]:
df.describe('cases').show()

In [11]:
df.select('fips','cases').show(5)

In [12]:
df.select('fips','cases').distinct().count()

In [13]:
#df.crosstab('state', 'cases').show()

In [14]:
#df.crosstab('state', 'cases').dropDuplicates().show()

In [15]:
df.dropna().count()

In [16]:
df.groupby('state').agg({'cases': 'mean'}).show()

In [17]:
df.groupby('cases').count().show()

In [18]:
#df.select('date').map(lambda x:(x,1)).take(5)

In [19]:
df.orderBy(df.cases.desc()).show(5)

In [20]:
%fs ls

path,name,size
dbfs:/FileStore/,FileStore/,0
dbfs:/databricks-datasets/,databricks-datasets/,0
dbfs:/databricks-results/,databricks-results/,0
dbfs:/tmp/,tmp/,0


In [21]:
%fs ls dbfs:/databricks-datasets

path,name,size
dbfs:/databricks-datasets/,databricks-datasets/,0
dbfs:/databricks-datasets/COVID/,COVID/,0
dbfs:/databricks-datasets/README.md,README.md,976
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359
dbfs:/databricks-datasets/adult/,adult/,0
dbfs:/databricks-datasets/airlines/,airlines/,0
dbfs:/databricks-datasets/amazon/,amazon/,0
dbfs:/databricks-datasets/asa/,asa/,0
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0
