# Coursera Troubleshooting
* Amanda Ballas - Security Systems Admin.

## First Steps # How?
1) gather info
	- what, why, result, consequences
	- Reproduction case
		recreate the problem to understand how and why

2) Find the root cause
3) Perform necessary remedition
	* maybe rebooting (workaround) (short-term remedition)
	
- check pc logs
- what was user doing, how long has it been going on

* short term remedition: cleaning the fan
* long term remedition: creating a IoT monitoring to remind you 

### Remarks

- get info
- modify program to get more/better log (debugging mode)
- isolate the causes
- understand the error messages 

# Technical Info
## W1 : Concepts

### Intermittent Bugs:
1) Heisenbugs (observer bugs)
    - the bug goes away when observing, but pop up again after
2) Bugs that go away when restarting 
    * (restart = power cycling)

### How to search

1) Linear Search
	- 1 2 3 4 ...
	- it works but can take a lot of time in a huge set
	- t --> len  
2) Binary Search
	- you keep halving the set (bisecting)
	- asking "is it on the left side, or right". 
	- i.e. for 100,000 comparisons, instead of 100,000
	- binary search requires sorting b4hand
	
	* git bisect

## W2 : Slowness

### Terms:
* Memory Leak
	- memory which is no longer needed is not getting released

* find root cause and bottleneck

* To access elements by position	
	- use dictionaries (much faster than lists)

* time (command)
	- gives real, user, sys time.
	- real:
		- actual time
	- user:
		- user space
	- sys:
		- sys-level ops

* profilers
	- pprofile3
	- kcachegrind

* Concurrency
	- a dedicated field of CS 
	- on parallel ops

* parallel ops
	- >1 core
		- OS decides what gets executed on which core

* Threads
	- lets us run parallel tasks inside a process
	- Py: Threading, AsyncIO

* Executor
	- process thats in charge of distributing the work among the diff. workers

* Futures module
	- Threads
		- has more security measures to 
		- avoid writing to the same var
		- (slight delay from t1 to t2, waiting to make sure
		- it's not the same var.)

		- GIL limitations. (Global Interpreter Lock) 
		- Limits the utilization of multiple cores
    - Processes
		- a little faster than threads
	- asyncIO
		- more powerful
		- bypasses GIL limitations

### A problem, hosting data.
* Hosting data, increasingly more popular. 
* Starting with .csv as most people do. 
* It's getting slower and slower, so you have to switch.
* Solution?
	- .csv 
	- SQLite
	- DB server
	- Dynamic cacher

### Terms and Definitions 
- (from Course 4, Module 2)

* Activity Monitor: 
	- Mac OS tool that shows what's using the most CPU, memory, energy, disk, or network
 
* Cache: 
	- This stores data in a form that's faster to access than its original form
 
* Executor: 
	- This is the process that's in charge of distributing the work among the different workers
 
* Expensive actions: 
	- Actions that can take a long time to complete
 
* Futures: 
	- A module provides a couple of different executors, one for using threads and the other one for using processes

* Lists: 
	- Sequences of elements
 
* Memory leak: 
	- This happens when a chunk of memory that's no longer needed is not released
 
* Profiler: 
	- A tool that measures the resources the code is using to see how the memory is allocated and how the time is spent
 
* Real time: 
	- The amount of actual time that it took to execute the command
 
* Resource Monitor (or Performance Monitor): 
	- Windows OS tool that shows what's using the most CPU, memory, energy, disk, or network

* Sys time: 
	- The time spent doing system level operations
 
* Threads: 
	- Run parallel tasks inside a process
 
* User time: 
	- The time spent doing operations in the user space

## W3 : Crashing Programs

### General Info:
- to find root cause of a crash, look at;
	- logs
	- changes/new versions (change history in VCS maybe?)
	- trace sys/library calls
	- create reproduction case (as small as possible!)

* Watchdog:
	- a process that checks whether a program is running, and when it's not, starts the program again.
		- run negative -> restart

* Answer these questions, when reporting a bug:
	- What were you trying to do?
	- What were steps you followed?
	- What did you expect to happen?
	- What was the actual outcome?
	* Also, try to create a reproduction case and submit it as well. Will help wonders!

* Blue screen of death (BSoD)
* Event Tracing for Windows (ETW)

### Linux Troubleshooting Methods:
#### Some Commands:
* strace 
	- to trace sys calls/signals.
* ptrace 
* netstat 
	- gives a bunch of network stats

#### Some Directories
* /var/log 
* /etc/<app folder> 
	- contains config files
* /srv/<app folder>
	- stores data for services&apps
	- i.e. 
		- "example.com" web content (Saved)
		- .py files for the service

#### Some Tools
* memtech86
	- to check for health of RAM

### Mac Troubleshooting Methods:
* dtruss (cmd)
* Console (app)

### Windows Troubleshooting Methods:
* process monitor (app)
* Event Viewer (app)




# Tips & Hazards
* Always check your hypothesis in a TEST env.


### Some Functions
- strace : system trace func
    * -o (flag): saves the output to a file [strace -o fail.strace ./script.py]
		then you can use "less fail.strace" to browse throught it (open it)
	- " strace ./script.py | less " works too.

- top
	- shows CPU loads
- iotop
    - shows I/O loads (input and output)
- iostat
    - shows statistics on I/O loads
- vmstat
    - shows stats on VM ops (virtual memory) 
- ionice
    - reduces priority to access the disk (nice I/O)
- nice
    - reduces priority to access the CPU

- iftop
    shows network traffic
- rsync
    used to backup data (sync)
    has an option to limit bandwidth (-bw) 
    # or can use a tool like Trickle to limit bw
- kill -STOP 
	suspends the system (pause)

### Logs. Which logs to read, where, how?
- Linux:
	- var/log/syslog
	- .xsession-errors
- macOS:
	- syslogs
	- /Library/Logs
- Windows:
	- "Event Viewer" tool