# DSC 340 Lab 1: Getting Started

## Objectives

This lab provides a review of skills you developed in DSC 270 and CSC 170 including:

* The Jupyter Notebook environment. 
* Essential Python programming skills:
  * functions, 
  * nested loops, 
  * math operations, and 
  * importing packages. 
* NumPy and Pandas.
  * While it is not too difficult to complete this assignment in plain Python, most students find it much easier to complete if they take full advantage of the Pandas and NumPy packages.
  * Note, however, that you *must* write a Python function to successfully complete this project.
* Importing and working with a simple data set to answer questions.

The lab also asks you to **communicate** your work in writing. Keep in mind that this Jupyter Notebook is not only a work environment in which to develop and run your code, but a presentation of your work. 

## _Overview_

_To become an effective data scientist, you need extensive practice. Labs&mdash;weekly analytics assignments done outside of class&mdash;are an opportunity to strengthen your understanding of applied machine learning and other analytic methods.  While the value of each lab as a percent of your final grade is relatively small, the wisdom and experience you will gain from doing each lab will significantly help you on your tests and further career in data science and analytics._

_The first lab is an introduction to the Jupyter Notebook environment and a review of fundamental programming skills in Python3. In most future labs, you will write code and analyze data in collaboration with a lab partner&mdash;and then independently write your report. **The first lab, however, is meant for you to do independently**&mdash;perhaps with some help from your instructor. We will be using the Jupyter environment for all of our labs and for the final project, so it is important that you are comfortable working in this environment._

## _General instructions_

_Your work will consist of code (in Code cells) and corresponding output, as well as formatted text (in Markdown cells).  In your lab reports, you will alternate between text exposition and code, sometimes with output in text or graphics.  You will describe what you are doing, provide the code to do it, and make observations.  A major advantage of the Jupyter Notebook type environment is that it allows you to submit your code, analytic work, and discussion as a single document that tells a "story" describing your work._

* _First, rename your file:_  

  * _Replace the word `instructions` with your last name, an underscore, and your first name._ 
  * _The `.ipynb` extension specifies that the file is an iPython (Jupyter) notebook file._ 
  * _Filename should be all lowercase.  So, if your name is Yennefer of Vengerberg, your file name would be_ `dsc305_lab01_vengerberg_yennefer.ipynb`. 
  * _This will be your file format for all future labs._ 

* _The files you start with include problems in **bold text**.  Solve each problem, explain your work, and analyze your results. Leave the instructions in place to help organize your work and to help me know which problem your are solving._

* _Your code should be readable and should have occasional helpful comments._

* _Alternate your code with Markdown cells describing what you are doing and summarizing your results._

* _The starter files contain additional instructions in_ italics _that you should delete before submitting your solution._ 

* _Once you have completed the assignment, upload your file along with ALL data files that you import from a local directory. Your submission should include everything I need to recompile (run) it from start to finish. If any cell shows an error message or warning, you will receive a reduced score._

_Be sure to follow the instructions carefully for the labs. By following the instructions, you make it easier for me to assess your lab and get prompt feedback to everyone in time for your work on the subsequent lab. If you do not follow the instructions&mdash;even small things such as filenames or including data files&mdash;you will receive a reduced score._

# _Your work starts here. Good luck!_

_All of your files should start with a Markdown box with the following information on separate lines (you can make Markdown include linebreaks with two spaces at the end of the line, or you can use HTML tags if you prefer):_

your name (e.g., Yennefer of Vengerberg);  
name of your lab partner (e.g., Geralt of Rivia);  
course designation and semester (DSC 340A S25); and  
name of the assignment (e.g., Lab 1: Getting Started).

In [None]:
# Import any Python packages that you need here.
# All import statements should be in your first code cell. Example:
import math
import pandas as pd
from math import sin, asin

**Import [this data file](https://raw.githubusercontent.com/jasperdebie/VisInfo/master/us-state-capitals.csv) containing the U.S. states and capitals and their geolocation (latitude and longitude).**

Imported & saved the file in the same local directory. Used method read_csv to store the file to the variable f_file.

In [None]:
f_file = pd.read_csv('https://raw.githubusercontent.com/jasperdebie/VisInfo/master/us-state-capitals.csv')
f_file

**Look closely at your data.  Do you notice any problems?  If so, address them!**

Utilized a string method to replace the HTML break tags with empty characters.

In [None]:
f_file['description'] = f_file['description'].str.replace('<br>', '')
#print(list(range(60, 60)))

**Sort the capitals from west to east.**

Sorted the dataset by longitudinal values.

In [None]:
f_file_sort = f_file.sort_values(by = 'longitude') # Sort in ascending order

# Output organized result
for i in range(len(f_file_sort)):
    capital = f_file_sort.iloc[i]['description']
    state = f_file_sort.iloc[i]['name']
    print(capital + ', ' + state)

**Write a function, `distance(lat1, lon1, lat2, lon2)` that uses the [haversine formula](https://en.wikipedia.org/wiki/Haversine_formula) to estimate the distance between two locations (given as latitude and longitude, in _degrees_) on the earth's surface.**

_The haversine formula (see link above) can be used to derive a distance between two points on a sphere, as follows:_

$$d = 2r \arcsin\left(\sqrt{\sin^2\left(\frac{\varphi_2 - \varphi_1}{2}\right) + \cos(\varphi_1) \cos(\varphi_2)\sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)}\right)$$

_In the formula above, $\varphi_1$ and $\lambda_1$ are the latitude and longitude of the first location *in radians*, and $\varphi_2$ and $\lambda_2$ are those of the second *in radians*.  The $r$ in the formula is the earth's radius&mdash;that is, the distance from the center of the earth to the surface.  The earth isn't perfectly spherical, but you can use $r \approx 3959$ miles as an approximation._

_The formula may seem a little intimidating, but it is straightforward to compute. No loops! Remember that you can import from the math module. You are also welcome to experiment with NumPy if you like. Also note that latitude and longitude use *degrees* as units, but the formula above requires *radians*.  Recall that $360^{\circ} = 2\pi$ radians. The Python3 `math` package provides a function `radians` that converts degrees to radians._

_**Important: Yes, there are packages that you can import that will compute the Haversine formula for you.  However, for this assignment, you should write the code yourself.**_

Distance function takes in 4 arguments. The first 2 for cartesian coordinates of location 1; last 2 for location 2.
We assume the Earth is spherical with radius approximately 3959 miles. 
Convert the cartesian coordinates from degrees to radians. 
Express Haversine formula using python syntax.

In [None]:
def distance(lat1, lon1, lat2, lon2):
    
    r = 3959

    # Convert deg to rad
    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)
    lon1 = math.radians(lon1)
    lon2 = math.radians(lon2)

    lat_diff = lat2 - lat1
    lon_diff = lon2 - lon1
    
    return (2 * r * math.asin(math.sqrt(math.sin(lat_diff / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(lon_diff / 2) ** 2)))

#print(distance(37.6456, -84.7722, -3.3614, 29.9187))

**Use a few examples to verify that your formula is correct.**
* _Here are some locations you can try:_
  * _Danville, Kentucky, USA (37.6456° N, 84.7722° W)_
  * _Frankfort, Kentucky, USA (38.1867° N, 84.8753° W)_
  * _Frankfurt, Germany (50.1109° N, 8.6821° E)_
* _The distance from Danville to Frankfort is about 38 miles._
* _The distance from Danville to Frankfurt is about 4,421 mi (&plusmn; 0.5%)._
* _The distance from any location to itself should be 0._

***Important Hint: Think about the meaning of a negative sign when representing geolocations as longitude and latitude.***

**Use your function to compute the distance of every state capital from Danville, Kentucky.**

In [None]:
danvil_lat, danvil_lon, l_distances = 37.6456, -84.7722, [] # Initialize Danville references

for i in range(len(f_file)): 
    capital_lat = f_file.iloc[i]['latitude']
    capital_lon = f_file.iloc[i]['longitude']
    d = distance(danvil_lat, danvil_lon, capital_lat, capital_lon)
    l_distances.append(d)

f_file['distance_from_danville'] = l_distances

print(f_file.head())

**Which two U.S. capitals are farthest apart? Which two are closest together?**

Using nested loops, we found the furthest two capitals to be **Augusta & Honolulu**.
While the closest two capitals are **Boston & Providence**.

In [None]:
l2d_pairwise_distance = [] # Initialize list

for i in range(len(f_file)): # Initialize loop
    for j in range(i + 1, len(f_file)): # Avoid duplicates
        s_capital_1 = f_file.iloc[i]['description']
        s_capital_2 = f_file.iloc[j]['description']
        lat_1 = f_file.iloc[i]['latitude']
        lon_1 = f_file.iloc[i]['longitude']
        lat_2 = f_file.iloc[j]['latitude']
        lon_2 = f_file.iloc[j]['longitude']
        d = distance(lat_1, lon_1, lat_2, lon_2) # Utilize distance formula
        l2d_pairwise_distance.append([s_capital_1, s_capital_2, d]) # Update list
   
    

# Store in a dataframe for interpretability
df_pairwise_distance = pd.DataFrame(l2d_pairwise_distance, columns = ['capital_1', 'capital_2', 'distance'])

# Sort dataframe accordingly
sorted_df_pairwise_distance = df_pairwise_distance.sort_values(by = 'distance')
desc_df_pairwise_distance = df_pairwise_distance.sort_values(by = 'distance', ascending = False)

# Inspect results
print("The two furthest capitals are:")
print(desc_df_pairwise_distance.iloc[0])

print("\nThe two closest capitals are:")
print(sorted_df_pairwise_distance.iloc[0])
print(df_pairwise_distance.tail())

## Conclusion

The Math never lies. Whether computationally through sorting by longitudes or exhaustion by geodistance calculations, you will arrive at the same results.

## Acknowlegements



## References

https://chat.deepseek.com/a/chat/s/a8cd54a3-f383-41dd-bd6c-26365640abad

_Check that you have followed all of the instructions and answered all the questions and that your report is in the specified format, alternating between exposition and code with output. Then upload your work to Moodle._